CrossRef / pdfextract

MOVED TO https://gitlab.com/crossref/pdfextract
https://gitlab.com/crossref/pdfextract
MIT License
508 stars 89 forks source link

Dupes with extract-bibs #16

Open paulusm opened 9 years ago

paulusm commented 9 years ago

The extracted bibtex files often seem to contain exact duplicate entries, which is causing me issues when trying to parse them.

jdherman commented 9 years ago

Yea, this is a pain but I don't think there is an easy fix. This can happen for two reasons:

  1. the PDF parser incorrectly splits a single reference into two, which will both resolve to the same DOI,
  2. or the web api incorrectly points two different references (say with similar authors) to the same DOI

In either case it would be tough to guarantee no duplicates. I usually use a bibtex manager like jabref or bibdesk to clean things up and remove duplicates before merging into the main bib file. I wouldn't trust this bib-extract to go straight to compiling without cleaning it up first.