allenai / scifact

Data and models for the SciFact verification task.
Other
223 stars 25 forks source link

Assertion error while predicting labels #16

Closed MHDBST closed 3 years ago

MHDBST commented 3 years ago

Hi, I'm using scifact to do fact checking on a personal dataset. First I create a corpus and then claims in the format suggested by the code. I run the following code to generate prediction on my own claims:

In the middle of running, after reading a bunch of lines and retrieving abstracts I get this error:

Predicting labels.
claim_and_rationale
Using device "cuda"
 53%|██████████████████████████████████████████████████▉                                              | 14037/26431 [02:24<02:07, 96.94it/s]
Traceback (most recent call last):
  File "verisci/inference/label_prediction/transformer.py", line 68, in <module>
    encoded_dict = encode([evidence], [claim])
  File "verisci/inference/label_prediction/transformer.py", line 50, in encode
    return_tensors='pt'
  File "/home/mbastan/context_home/anaconda2/envs/scifact/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 1239, in batch_encode_plus
    return_special_tokens_mask=return_special_tokens_masks,
  File "/home/mbastan/context_home/anaconda2/envs/scifact/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 1371, in prepare_for_model
    stride=stride,
  File "/home/mbastan/context_home/anaconda2/envs/scifact/lib/python3.7/site-packages/transformers/tokenization_utils.py", line 1510, in truncate_sequences
    assert len(ids) > num_tokens_to_remove
AssertionError

Then it tries to skip this step but it can not continue because it hasn't created merged_predictions.jsonl file. Why this error occurs and how can I solve it?

dwadden commented 3 years ago

You're running the model on your own dataset, is that correct?

First, can you confirm that you've set up your virtualenv as described in the README? Just want to rule out software dependency issues.

If that's not it, it looks like something's going wrong in the tokenizer. I think the first thing to do is try to isolate the example that's causing the problem, and figure out why it's breaking the tokenizer. My guess is that maybe you're giving an input sequence that is longer than the 512-token max for RoBERTa, but that's just a guess.

If you can't get it working, feel free to post a minimal code example with the exact string that's causing the tokenization error, and I can try to help debug further.

dwadden commented 3 years ago

Closing due to lack of activity.