Closed Mayar2009 closed 2 months ago
Hey @Mayar2009 yes, I recommend doing something like this:
def process_batch(batch: dict):
# this downloads both the metadata & full text files for a particular shard
cmd = ["wget", "-O", batch['input_metadata_path'], batch['input_metadata_url']]
subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, check=False)
cmd = ["wget", "-O", batch['input_pdf_parses_path'], batch['input_pdf_parses_url']]
subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, check=False)
# first, let's filter metadata JSONL to only papers with a particular field of study.
# we also want to remember which paper IDs to keep, so that we can get their full text later.
paper_ids_to_keep = set()
with gzip.open(batch['input_metadata_path'], 'rb') as gz, open(batch['output_metadata_path'], 'wb') as f_out:
f = io.BufferedReader(gz)
for line in tqdm(f.readlines()):
metadata_dict = json.loads(line)
paper_id = metadata_dict['paper_id']
mag_field_of_study = metadata_dict['mag_field_of_study']
if mag_field_of_study and 'Medicine' in mag_field_of_study: # TODO: <<< change this to your filter
paper_ids_to_keep.add(paper_id)
f_out.write(line)
# now, we get those papers' full text
with gzip.open(batch['input_pdf_parses_path'], 'rb') as gz, open(batch['output_pdf_parses_path'], 'wb') as f_out:
f = io.BufferedReader(gz)
for line in tqdm(f.readlines()):
metadata_dict = json.loads(line)
paper_id = metadata_dict['paper_id']
if paper_id in paper_ids_to_keep:
f_out.write(line)
# now delete the raw files to clear up space for other shards
os.remove(batch['input_metadata_path'])
os.remove(batch['input_pdf_parses_path'])
and running this on each shard of the dataset (i.e. incrementally downloading & keeping only the parts that you want)
Hey @Mayar2009 yes, I recommend doing something like this:
def process_batch(batch: dict): # this downloads both the metadata & full text files for a particular shard cmd = ["wget", "-O", batch['input_metadata_path'], batch['input_metadata_url']] subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, check=False) cmd = ["wget", "-O", batch['input_pdf_parses_path'], batch['input_pdf_parses_url']] subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, check=False) # first, let's filter metadata JSONL to only papers with a particular field of study. # we also want to remember which paper IDs to keep, so that we can get their full text later. paper_ids_to_keep = set() with gzip.open(batch['input_metadata_path'], 'rb') as gz, open(batch['output_metadata_path'], 'wb') as f_out: f = io.BufferedReader(gz) for line in tqdm(f.readlines()): metadata_dict = json.loads(line) paper_id = metadata_dict['paper_id'] mag_field_of_study = metadata_dict['mag_field_of_study'] if mag_field_of_study and 'Medicine' in mag_field_of_study: # TODO: <<< change this to your filter paper_ids_to_keep.add(paper_id) f_out.write(line) # now, we get those papers' full text with gzip.open(batch['input_pdf_parses_path'], 'rb') as gz, open(batch['output_pdf_parses_path'], 'wb') as f_out: f = io.BufferedReader(gz) for line in tqdm(f.readlines()): metadata_dict = json.loads(line) paper_id = metadata_dict['paper_id'] if paper_id in paper_ids_to_keep: f_out.write(line) # now delete the raw files to clear up space for other shards os.remove(batch['input_metadata_path']) os.remove(batch['input_pdf_parses_path'])
and running this on each shard of the dataset (i.e. incrementally downloading & keeping only the parts that you want)
sorry, I run this code but encounter the problem that "FileNotFoundError: [Errno 2] No such file or directory: '/20200705v1/metadata/raw/metadata_0.jsonl.gz'". It seems that the subprocess didn't work well. Do you know the reason?
How do we implement this function? My colab session crashes every time I just try to read the url response.
hi @smcgrogan! This downloading function is out of date as we've migrated to releasing S2ORC under the Semantic Scholar API. The README instructions has more up-to-date example of how to do it. Please check that out?
https://github.com/allenai/s2orc?tab=readme-ov-file#download-instructions
Can I get Just part from this data set cause I do not have enough space on my disk?