allenai / specter

SPECTER: Document-level Representation Learning using Citation-informed Transformers
Apache License 2.0
516 stars 56 forks source link

Use custom dataset without hard negatives #34

Open ryangawei opened 3 years ago

ryangawei commented 3 years ago

Hi,

I'm trying to create preprocessed training files using my custom data. My data doesn't include any hard negatives, and when I use your script create_training_files.py, errors show up saying no triplets are constructed:

2021-08-20 14:30:58,836,836 INFO [create_training_files.py:453] loading metadata: ../../data/specter/metadata.json
2021-08-20 14:30:58,907,907 INFO [create_training_files.py:457] loading data file: ../../data/specter/data.json
2021-08-20 14:30:59,040,40 INFO [create_training_files.py:466] getting instances for `data` and `train` set
2021-08-20 14:30:59,041,41 INFO [create_training_files.py:468] writing output ../../data/specter/preprocessed/data-train.p
2021-08-20 14:30:59,101,101 INFO [create_training_files.py:303] Generating triplets ...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 85452/85452 [01:00<00:00, 1404.12it/s]
INFO:/home/guoao/anaconda3/envs/specter/lib/python3.7/site-packages/specter-0.0.1-py3.7.egg/specter/data_utils/triplet_sampling.py:Done generating triplets, #successful queries: 0,#skipped queries: 85452
2021-08-20 14:32:01,745,745 INFO [create_training_files.py:365] done getting triplets, success rate:0.00%,total: 0
2021-08-20 14:32:01,746,746 INFO [create_training_files.py:407] converting raw instances to allennlp instances:
0it [00:00, ?it/s]

Then I dive into the script specter/data_utils/triplet_sampling.py to use TripletGenerator and see what happens (since I can't use breakpoints in multiprocess programs). I find out that since there're no hard negatives, the margin here becomes 0.0, making the candidates_pos a blank list.

If I change the line to if candidates[j][1] >= margin + candidates[-1][1]:, the function will work. I don't really understand the meaning of margin and not sure if changing the line will impact the generation results or not. So I wonder if it's safe to do so?

Thank!