acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
370 stars 252 forks source link

[Bug report] Hash mismatch error for file #2047

Open niranjanaunnithan opened 1 year ago

niranjanaunnithan commented 1 year ago

Encountering error, "ERROR Hash mismatch for file /acl-anthology/bin/../build/anthology-files/pdf/emoji/2022.emoji-1.0.pdf, downloaded from https://aclanthology.org/2022.emoji-1.0.pdf. was f30c68cc should be 36bb53fe" for several pdf files while generating the anthology via make mirror command.

xinru1414 commented 1 year ago

@niranjanaunnithan Can you elaborate? I don't fully understand the bug. The xml file has the correct hash for pdf file https://aclanthology.org/2022.emoji-1.0.pdf

niranjanaunnithan commented 1 year ago

Hi, we were trying to download the pdf files from acl-anthology by using the make mirror command. On examining the log file, we came across the following error multiple times.

ERROR Hash mismatch for file /acl-anthology/bin/../build/anthology-files/pdf/emoji/2022.emoji-1.0.pdf, downloaded from https://aclanthology.org/2022.emoji-1.0.pdf. was f30c68cc should be 36bb53fe

We also observed that only 74234 pdf files were downloaded. Please find a screenshot of a portion of the logs.

image
akoehn commented 1 year ago

Regarding this:

We also observed that only 74234 pdf files were downloaded

Not all papers have PDFs that can be downloaded, so this is totally fine.

Regarding the hash mismatches: this is either due to you not having up to date checkout of the git repository (and the hashes have changed in the meantime), some network problem (is this reproducible when you run it again? The script should only retry the ones that failed before), or a problem on the server side that needs to be addressed.

I checked the emoji one and anthology_utils.compute_hash_from_file('/tmp/2022.emoji-1.0.pdf') returns 'f30c68cc' (as in your comment) but the XML file also has this hash: https://github.com/acl-org/acl-anthology/blob/7e309c89b81af82cc47194a48b63d31487c69766/data/xml/2022.emoji.xml#L17

So, my guess is that your local repository is outdated -- the hash in our data was changed 11 days ago because the PDFs were updated.

niranjanaunnithan commented 1 year ago

Hi. I followed the steps as suggested and ran the script again after a git pull (This was done on July 21). I am still observing the hash mismatch error in the logs. Please find a snippet of the logs.

Files that could not be downloaded

https://aclanthology.org/P19-2050v1.pdf

Files with checksum mismatch

https://aclanthology.org/1991.tc-1.1.pdf https://aclanthology.org/2006.amta-panels.0.pdf https://aclanthology.org/2006.amta-panels.1.pdf https://aclanthology.org/2006.amta-panels.2.pdf https://aclanthology.org/2006.amta-panels.3.pdf https://aclanthology.org/2006.amta-panels.4.pdf https://aclanthology.org/2006.amta-panels.5.pdf https://aclanthology.org/2017.iwslt-1.0.pdf https://aclanthology.org/2021.acl-long.79.pdf https://aclanthology.org/2021.acl-srw.16.pdf https://aclanthology.org/2021.acl-long.79v2.pdf https://aclanthology.org/2021.americasnlp-1.pdf https://aclanthology.org/2021.autosimtrans-1.pdf https://aclanthology.org/2021.calcs-1.pdf https://aclanthology.org/2021.clpsych-1.pdf https://aclanthology.org/2021.cmcl-1.pdf https://aclanthology.org/2021.dash-1.pdf https://aclanthology.org/2021.deelio-1.pdf https://aclanthology.org/2021.emnlp-main.300.pdf https://aclanthology.org/2021.emnlp-main.409.pdf https://aclanthology.org/2021.emnlp-main.824.pdf https://aclanthology.org/2021.motra-1.0.pdf https://aclanthology.org/2021.mrl-1.5v1.pdf https://aclanthology.org/2021.mtsummit-up.pdf https://aclanthology.org/2021.naacl-demos.pdf https://aclanthology.org/2021.naacl-srw.pdf https://aclanthology.org/2021.naacl-tutorials.pdf https://aclanthology.org/2021.naacl-industry.pdf https://aclanthology.org/2021.naacl-main.189.pdf https://aclanthology.org/2021.nlp4if-1.pdf https://aclanthology.org/2021.nlpmc-1.pdf https://aclanthology.org/2021.privatenlp-1.pdf https://aclanthology.org/2021.sdp-1.pdf https://aclanthology.org/2021.smm4h-1.pdf https://aclanthology.org/2021.socialnlp-1.pdf https://aclanthology.org/2021.splurobonlp-1.pdf https://aclanthology.org/2021.teachingnlp-1.pdf https://aclanthology.org/2021.textgraphs-1.pdf https://aclanthology.org/2021.trustnlp-1.pdf https://aclanthology.org/2021.vigil-1.pdf https://aclanthology.org/2021.wmt-1.73.pdf https://aclanthology.org/2022.acl-long.52.pdf https://aclanthology.org/2022.iwslt-1.9.pdf https://aclanthology.org/2022.repl4nlp-1.pdf

niranjanaunnithan commented 1 year ago

Hi. Is there any update on this issue? I am still experiencing this even with a fresh clone of the latest master branch.