bionlplab / bioc

Data structures and code to read/write BioC XML and Json.
MIT License
29 stars 11 forks source link

Incrementally decoding the BioC Json of `.tar.gz`-collection #20

Closed raven44099 closed 1 year ago

raven44099 commented 1 year ago

PubMed Central provides their Open Access articles in the BioC JSON-format (see API and Bulk Download). I downloaded one portion with wget https://ftp.ncbi.nlm.nih.gov/pub/wilbur/BioC-PMC/PMC095XXXXX_json_ascii.tar.gz and want to document-wise apply a filter (need to save memory). I tried following code:

from tqdm import tqdm
import gzip
import io

keyword = 'diabetes'
my_doi_list = []
path_file_PMC = '/content/PMC095XXXXX_json_ascii.tar.gz'
path_file_PMC_filtered = '/content/result'

with gzip.open(path_file_PMC, 'rb') as gz, open(path_file_PMC_filtered, 'wb') as f_out:
    f = io.BufferedReader(gz)
    for line in tqdm(f.readlines()):
        record = json.loads(line)
        # doi = record['documents'][0]['passages'][0]['infons']['article-id_doi']
        if keyword in record['documents'][0]['passages'][0]['text']: 
            # my_doi_list.append(doi)
            f_out.write(line)

But face an error:

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

```py 0%| | 0/95046 [00:00](https://localhost:8080/#) in 19 # f = gz 20 for line in tqdm(f.readlines()): ---> 21 record = json.loads(line) 22 doi = record['documents'][0]['passages'][0]['infons']['article-id_doi'] 23 if keyword in record['documents'][0]['passages'][0]['text']: # TODO: <<< change this to your filter 2 frames [/usr/lib/python3.7/json/__init__.py](https://localhost:8080/#) in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw) 346 parse_int is None and parse_float is None and 347 parse_constant is None and object_pairs_hook is None and not kw): --> 348 return _default_decoder.decode(s) 349 if cls is None: 350 cls = JSONDecoder [/usr/lib/python3.7/json/decoder.py](https://localhost:8080/#) in decode(self, s, _w) 335 336 """ --> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end()) 338 end = _w(s, end).end() 339 if end != len(s): [/usr/lib/python3.7/json/decoder.py](https://localhost:8080/#) in raw_decode(self, s, idx) 353 obj, end = self.scan_once(s, idx) 354 except StopIteration as err: --> 355 raise JSONDecodeError("Expecting value", s, err.value) from None 356 return obj, end JSONDecodeError: Expecting value: line 1 column 1 (char 0) ```

Then I found your python package, and I thought I could use one of the codes provided here: https://bioc.readthedocs.io/en/latest/biocjson.html However, I don't want to unzip or untar, but from your code examples it is not clear which format the file at `filename` has. Is it possible to use functions of your package but use `.tar.gz.` as input or do I need to unzip (w/o untar)?