allenai / specter

SPECTER: Document-level Representation Learning using Citation-informed Transformers
Apache License 2.0
508 stars 55 forks source link

Create preprocessed training files: metadata.json is missing ids in the train.txt, test.txt and val.txt #13

Open shauryr opened 4 years ago

shauryr commented 4 years ago

When I run the following -

python specter/data_utils/create_training_files.py \
--data-dir data/training \
--metadata data/training/metadata.json \
--outdir data/preprocessed/

I get done getting triplets, success rate:0.00%

and my data-metrics.json looks like -

{
  "train": 0,
  "val": 0,
  "test": 0
}

I debugged the code and found that at line there is a key error when self.metadata is called. Looks like the ids in train.txt, val.txt and test.txt are not in the metadata.json file

Please help and share the correct metadata.json file

chashimo commented 4 years ago

I got the same problem. It seems that metadata.json requires 'paper_id' in addition to 'title' and 'abstract'.

armancohan commented 3 years ago

The sample metadata file was updated and this should be fixed now. Let us know if you still have issues.

malteos commented 3 years ago

I still have the same problem. Apparently, most paper_ids do not match. For example:

2020-10-27 11:38:16,851,851 ERROR [create_training_files.py:358] '1a090df137014acab572aa5dc23449b270db64b4' 2020-10-27 11:38:16,852,852 INFO [create_training_files.py:362] done getting triplets, success rate:0.00%,total: 15

sergeyf commented 3 years ago

@armancohan any updates here?

yrrah commented 3 years ago

The data.json contains many ids that don't exist in metadata.json I made up a new data.json that works data.txt

ryangawei commented 3 years ago

@yrrah thanks for the solution. It works for me!