MicheleCotrufo / pdf2bib

A python library/command-line tool to quickly and automatically generate BibTeX data starting from the pdf file of a scientific publication.
58 stars 7 forks source link

Incorrect bibliographic information extracted from OpenReview PDFs #17

Open yutojubako opened 1 month ago

yutojubako commented 1 month ago

I've encountered a problem where pdf2bib is extracting incorrect bibliographic information from PDFs obtained from OpenReview. In some cases, the extracted BibTeX entries correspond to entirely different papers.

Steps to reproduce

Download a PDF from an OpenReview forum (e.g., https://openreview.net/forum?id=C0jJAbMMub) Use pdf2bib to extract bibliographic information from the downloaded PDF Observe that the resulting BibTeX entry does not match the paper's actual information

Expected behavior

The extracted BibTeX entry should correspond to the paper from which the PDF was obtained. Actual behavior The extracted BibTeX entry corresponds to a different paper. For example, when processing a PDF from the OpenReview forum mentioned above, the tool returns bibliographic information for the "Segment Anything" paper (https://arxiv.org/abs/2304.02643) instead.

Additional information

This issue appears to occur with multiple PDFs from OpenReview, not just a single instance.

Possible causes

  1. Incorrect metadata in the PDFs from OpenReview
  2. An issue with pdf2bib's parsing logic for OpenReview PDFs
  3. A problem with the online database or API that pdf2bib might be using for verification

Suggested next steps

Investigate the metadata of affected PDFs to check for anomalies Review pdf2bib's parsing logic for OpenReview documents Check if there are any issues with external APIs or databases used by pdf2bib

I'm happy to provide more information or specific examples if needed. Thank you for your attention to this issue.