allenai / scifact

Data and models for the SciFact verification task.
Other
223 stars 25 forks source link

Different results for label prediction #26

Closed msadat3 closed 1 year ago

msadat3 commented 2 years ago

Hi,

I tried to reproduce the label prediction results reported in Table 3 by using the following command:

./script/label-prediction.sh roberta_base scifact dev

I get the following output:

Accuracy 0.6168 Macro F1: 0.5478 Macro F1 w/o NEI: 0.4449

C N S
F1: [0.2703 0.7536 0.6196] Precision: [0.375 0.6341 0.6752] Recall: [0.2113 0.9286 0.5725]

Confusion Matrix: [[ 15 24 32] [ 2 104 6] [ 23 36 79]]

As we can see, the accuracy is lower than the one reported in the paper (62.9%). I am also getting a lower accuracy for roberta_large. Is this expected? Thanks in advance.

dwadden commented 2 years ago

Hi,

I'll try to take a look later in the week, probably Friday.

Dave

msadat3 commented 2 years ago

Sounds good. Thanks for your response.

dwadden commented 1 year ago

Hi,

I ran the command you posted and the output I get is consistent with the results in the table. Full console output below. Maybe it's a versioning issue? Can you confirm that you've installed dependencies exactly as specified in the README? If that's not the issue, I can send you the intermediate outputs I get from label-prediction.sh so that we can track down where things are different.

Dave

(scifact) $ ./script/label-prediction.sh roberta_base scifact dev
Running pipeline on dev set.
Data directory already exists. Skip download.
Model label_roberta_base_scifact already exists. Skip download.

Retrieving oracle abstracts.

Selecting oracle rationales.

Predicting labels.
claim_and_rationale
Using device "cuda"
100%|███████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:04<00:00, 71.67it/s]

Evaluating.
Accuracy           0.6293
Macro F1:          0.5635
Macro F1 w/o NEI:  0.4498

                   [C      N      S     ]
F1:                [0.2712 0.7909 0.6284]
Precision:         [0.3404 0.6887 0.6667]
Recall:            [0.2254 0.9286 0.5942]

Confusion Matrix:
[[ 16  20  35]
 [  2 104   6]
 [ 29  27  82]]
msadat3 commented 1 year ago

Hello,

I did look at the instructions in the README file and tried installing all the dependencies. When I ran the command "pip install -r requirements.txt", I received the following error:

ERROR: Cannot install -r requirements.txt (line 2) and urllib3==1.26.5 because these package versions have conflicting dependencies.

The conflict is caused by:
    The user requested urllib3==1.26.5
    botocore 1.15.36 depends on urllib3<1.26 and >=1.20; python_version != "3.4"

To fix this you could try to:
1. loosen the range of package versions you've specified
2. remove package versions to allow pip attempt to solve the dependency conflict

I fixed this issue by changing "urllib3==1.26.5" to "urllib3" in the "requirements.txt" file. It was probably not ideal but all other dependencies installed successfully after making this change.

Unfortunately, when I ran the "label-prediction.sh" file in this new conda environment (with all dependencies installed), the program got stuck and after a while exited showing the following error:

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

Evaluating.
Traceback (most recent call last):
  File "verisci/evaluate/label_prediction.py", line 40, in <module>
    print(f'Accuracy           {round(sum([pred_labels[i] == true_labels[i] for i in range(len(pred_labels))]) / len(pred_labels), 4)}')
ZeroDivisionError: division by zero

So, I had to run the program on my existing environment. I assumed that since I am using the same model weights and it was able to make the predictions, the results would be the same.

I hope the details I provided will help in pinpointing the issue. Thank you so much for your time.

dwadden commented 1 year ago

Ah I see, that's weird about the dependencies, I'm pretty sure I didn't have this problem when I created the repo. To fix it, I altered requirements.txt by setting urllib3==1.25.11. Things installed OK, and prediction worked the same as before.

Can you confirm that this works for you and I'll update the requirements.txt file accordingly?

msadat3 commented 1 year ago

Thanks for looking into it.

Yes, the dependency issue gets solved but the program still gets stuck and exits with the aforementioned error.

I updated this line to the following:

model = AutoModelForSequenceClassification.from_pretrained(args.model, config=config).eval()
print('model loaded')
model = model.to(device)
print('model moved to cuda')

It looks like the issue is with moving the model to cuda because it gets stuck after printing 'model loaded.' My CUDA version is 11.4. Maybe I need an older version?

dwadden commented 1 year ago

That might be it. The version I'm using has CUDA 11.2. Is there any way for you to check whether it works for you with an older version?

msadat3 commented 1 year ago

I do not think so since mine is a shared server. Any pointers on where the new version of transformers (compatible with cuda 11.4) could be doing things differently which results in a different score?

dwadden commented 1 year ago

Unfortunately, I don't have any ideas other than identifying specific examples where the prediction changes and then trying to debug the forward pass of the model (i.e. look at activations at each layer). This would likely be painful. If you're able to localize the change to the version of transformers being used, you might be able to get help if you post an issue on the Huggingface website. Other than that, I'm sorry to say that I can't really offer more help if you're using different software than I've used. I'll update the README to specify the version of CUDA that I used to get my results. One final idea would be to try running inference on CPU and see if you can match my results that way?

msadat3 commented 1 year ago

I completely understand. Thanks for all your help.

Yes, I was able to reproduce your results when I ran on CPU.

dwadden commented 1 year ago

OK, I'm glad the CPU results work at least. Sorry we couldn't get to the bottom of it... GPU's are mysterious.

msadat3 commented 1 year ago

No problem at all. I will let you know if I can figure out a fix. Thanks again.