allenai / s2orc

S2ORC: The Semantic Scholar Open Research Corpus: https://www.aclweb.org/anthology/2020.acl-main.447/
800 stars 64 forks source link

S3 Metadata Archive Format #5

Closed daniel-chou closed 4 years ago

daniel-chou commented 4 years ago

The metadata file at the S3 path (s3://ai2-s2-gorc-release/20190928/metadata.tar.gz) appears to be in bzip2 format. Would you consider renaming the file? Perhaps metadata.tar.bz2 would be a possibility?

kyleclo commented 4 years ago

Hey @daniel-chou sorry for the delay.

While looking into the metadata issue, we also decided to modify the tarball situation. Instead, we've sharded the metadata file into multiple files, one per batch of papers.

They can be found at ai2-s2-gorc-release/20190928/metadata/ labeled 0.tsv, 1.tsv, ..., to 10000.tsv.

Hopefully this makes things easier!

daniel-chou commented 4 years ago

@kyleclo Thanks for sharding the metadata file into multiple TSV files! This works well for me. 👍

Just to confirm the .tsv files range from 0 to 9998; neither 9999.tsv nor 10000.tsv exists. Is this correct?

kyleclo commented 4 years ago

Ah, nice catch! Looks like forgot to copy 9999.tsv over. It's added now. There is no 10000.tsv. Thanks!

daniel-chou commented 4 years ago

Great! I downloaded 9999.tsv just now. Thank you for resolving this issue. 👍