PubMed Central provides their Open Access articles in the BioC JSON-format (see API and Bulk Download). I downloaded one portion with
wget https://ftp.ncbi.nlm.nih.gov/pub/wilbur/BioC-PMC/PMC095XXXXX_json_ascii.tar.gz
and want to document-wise apply a filter (need to save memory). I tried following code:
from tqdm import tqdm
import gzip
import io
keyword = 'diabetes'
my_doi_list = []
path_file_PMC = '/content/PMC095XXXXX_json_ascii.tar.gz'
path_file_PMC_filtered = '/content/result'
with gzip.open(path_file_PMC, 'rb') as gz, open(path_file_PMC_filtered, 'wb') as f_out:
f = io.BufferedReader(gz)
for line in tqdm(f.readlines()):
record = json.loads(line)
# doi = record['documents'][0]['passages'][0]['infons']['article-id_doi']
if keyword in record['documents'][0]['passages'][0]['text']:
# my_doi_list.append(doi)
f_out.write(line)
But face an error: JSONDecodeError: Expecting value: line 1 column 1 (char 0)
```py
0%| | 0/95046 [00:00, ?it/s]
---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
[](https://localhost:8080/#) in
19 # f = gz
20 for line in tqdm(f.readlines()):
---> 21 record = json.loads(line)
22 doi = record['documents'][0]['passages'][0]['infons']['article-id_doi']
23 if keyword in record['documents'][0]['passages'][0]['text']: # TODO: <<< change this to your filter
2 frames
[/usr/lib/python3.7/json/__init__.py](https://localhost:8080/#) in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
346 parse_int is None and parse_float is None and
347 parse_constant is None and object_pairs_hook is None and not kw):
--> 348 return _default_decoder.decode(s)
349 if cls is None:
350 cls = JSONDecoder
[/usr/lib/python3.7/json/decoder.py](https://localhost:8080/#) in decode(self, s, _w)
335
336 """
--> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
338 end = _w(s, end).end()
339 if end != len(s):
[/usr/lib/python3.7/json/decoder.py](https://localhost:8080/#) in raw_decode(self, s, idx)
353 obj, end = self.scan_once(s, idx)
354 except StopIteration as err:
--> 355 raise JSONDecodeError("Expecting value", s, err.value) from None
356 return obj, end
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
```
Then I found your python package, and I thought I could use one of the codes provided here: https://bioc.readthedocs.io/en/latest/biocjson.html
However, I don't want to unzip or untar, but from your code examples it is not clear which format the file at `filename` has. Is it possible to use functions of your package but use `.tar.gz.` as input or do I need to unzip (w/o untar)?
PubMed Central provides their Open Access articles in the
BioC JSON
-format (see API and Bulk Download). I downloaded one portion withwget https://ftp.ncbi.nlm.nih.gov/pub/wilbur/BioC-PMC/PMC095XXXXX_json_ascii.tar.gz
and want to document-wise apply a filter (need to save memory). I tried following code:But face an error:
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
```py 0%| | 0/95046 [00:00, ?it/s] --------------------------------------------------------------------------- JSONDecodeError Traceback (most recent call last) [](https://localhost:8080/#) in
19 # f = gz
20 for line in tqdm(f.readlines()):
---> 21 record = json.loads(line)
22 doi = record['documents'][0]['passages'][0]['infons']['article-id_doi']
23 if keyword in record['documents'][0]['passages'][0]['text']: # TODO: <<< change this to your filter
2 frames
[/usr/lib/python3.7/json/__init__.py](https://localhost:8080/#) in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
346 parse_int is None and parse_float is None and
347 parse_constant is None and object_pairs_hook is None and not kw):
--> 348 return _default_decoder.decode(s)
349 if cls is None:
350 cls = JSONDecoder
[/usr/lib/python3.7/json/decoder.py](https://localhost:8080/#) in decode(self, s, _w)
335
336 """
--> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
338 end = _w(s, end).end()
339 if end != len(s):
[/usr/lib/python3.7/json/decoder.py](https://localhost:8080/#) in raw_decode(self, s, idx)
353 obj, end = self.scan_once(s, idx)
354 except StopIteration as err:
--> 355 raise JSONDecodeError("Expecting value", s, err.value) from None
356 return obj, end
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
```
Then I found your python package, and I thought I could use one of the codes provided here: https://bioc.readthedocs.io/en/latest/biocjson.html
However, I don't want to unzip or untar, but from your code examples it is not clear which format the file at `filename` has. Is it possible to use functions of your package but use `.tar.gz.` as input or do I need to unzip (w/o untar)?