What is the recommended hardware and software solutions to use this dataset?

tomleung1996 commented 4 years ago

Thank you for your generosity to share this great corpus! However, I have some questions about using this dataset.

1) I don't have powerful clusters. What I have is a PC with i7-8700K, 64GB RAM, and 10TB storage. Is it feasible to perform an analysis on this corpus with my PC? It would be awesome if you can share the recommended hardware as I have no experience of dealing with such a large amount of data.

2) I want to parse the jsonl files and put them into the MongoDB database, but I don't know if it is a good practice since it requires additional time and space to parse and store. Can you also share the software solution employed by your team to deal with this dataset?

Thanks!

kyleclo commented 4 years ago

Hey @tomleung1996, thanks for the interest!

The whole dataset uncompressed is less than 1TB, so you should be fine on storage. The other specs all look fine to me, the only concern is how many CPUs you can multiprocess over?

S2ORC is essentially a raw corpus of individual papers, so the easiest way to work with it is to define some sort of processing function that converts a single paper into the format you'd like for your downstream application. For example, some people who want to work on language modeling will convert each paper into newline-separated paragraph blocks and remove all the metadata. Others who want to work on citation contexts will run a spacy sentencizer over paragraph blocks and only keep sentences with citation mentions in them.

Regardless of what you do, it's just faster if you can multiprocess that function across a lot of the S2ORC blobs; otherwise, it can take a while to go through 81M rows one by one. (Some people have done it this way still, if no other option).

As for MongoDB, sorry I'm unfamiliar with this.

As for the software solution, what we've been doing is entirely in Python:

# pseudocode, doesnt actually run:
def process_s2orc_papers(batch: dict):
     infile = batch['infile']
     outfile = batch['outfile']
     with open(infile) as f_in, open(outfile, 'w') as f_out:
           for line in f_in:
                 paper_dict = json.loads(line)
                 ...do something with this paper...
                 f_out.write(....output....)

batches = [
     {
          'infile': fname,
          'outfile': ...,   # or whatever output format you want
     } for fname in ['0.jsonl', '1.jsonl', ...., '9999.jsonl']
]

with multiprocessing.Pool(processes=multiprocessing.cpu_count()) as p:
    p.map(process_s2orc_papers, batches)

tomleung1996 commented 4 years ago

Thanks! I want to store the metadata, citation context, and the citation relationship in the database.

I thought that your team might have performed some bibliometrics analysis on this corpus (e.g. constructing a citation network), so I was wondering what software solution you were using. Not just preprocessing the raw data with Python.

kyleclo commented 4 years ago

The intention behind this dataset is a general-purpose raw corpus that preserves structure in the full-text. We're expecting people to perform their own processing of the dataset in whatever manner they choose to derive their own respective task dataset.

For example, we have projects that use S2ORC as an upstream resource, and each of them has its own processing code.

tomleung1996 commented 4 years ago

Thank you @kyleclo, you are very helpful!

Will S2ORC update on a regular basis like 6 months of 1 year? If so, can I use the aws sync command to update incrementally? That would be awesome.

kyleclo commented 4 years ago

We want to do regular updates, but it's a tad expensive to perform over this many papers. Still thinking through how to do this.

I think long-term, we might want something like incremental exports. But the first set of updates of the corpus export will likely be full batch dumps.

What sort of application are you thinking about that requires timely syncs?

tomleung1996 commented 4 years ago

I am considering S2ORC could be a good alternative to Web of Science or Scopus to perform bibliometric analysis (Wos and Scopus don't provide citation spans or the full text). It doesn't require timely syncs, once a year or two years is fast enough.

I think once your team finalizes the structure of this corpus, it can be updated incrementally since the references of a published paper usually don't change.

Thank you!

allenai / s2orc

What is the recommended hardware and software solutions to use this dataset? #12