MicheleCotrufo / pdf2doi

A python library/command-line tool to extract the DOI or other identifiers of a scientific paper from a pdf file.
101 stars 18 forks source link

Add file DOI check to URL paths #19

Closed DJRHails closed 2 years ago

DJRHails commented 2 years ago

Often a search can surface DOI descriptors in the URL path alone, for instance:

[pdf2doi]: Performing google search with key "The Experimental Generation of Interpersonal Closeness: A Procedure and Some Preliminary Findings"
[pdf2doi]: and looking at the first 6 results...
[pdf2doi]: Looking for a valid identifier in the search result #1 : https://journals.sagepub.com/doi/pdf/10.1177/0146167297234003
[pdf2doi]: Looking for a valid identifier in the search result #2 : https://journals.sagepub.com/doi/abs/10.1177/0146167297234003

Supporting this would give quicker identifications, but also allow for occasions, such as this, where the DOI can't be extracted from the actual page.

https://doi.org/10.1177/0146167297234003

MicheleCotrufo commented 2 years ago

That's an interesting suggestion, thanks! I will implement it in the next version!

DJRHails commented 2 years ago

I ended up implementing this along with a few other improvements on a fork.

Happy to create a PR to merge upstream if there is interest.

MicheleCotrufo commented 2 years ago

Thanks a lot for your efforts! Yes, can you create the PR? I will check it in detail and approve it within a few days.

DJRHails commented 2 years ago

See https://github.com/MicheleCotrufo/pdf2doi/pull/20