[Suggestion] Look into PDF text-annotations for valid DOIs

alexpiti commented 1 year ago

First of all, thanks for the awesome tool! It saved me lots of time during my bibliography/SOTA runs, or by batch-renaming 100s of PDF files for easier indexing.

Now, to the point:

a) Some background: I disabled Google-searching (Methods #4 and #5) as they rarely worked on old/no-DOI papers in my field (I am an electromagnetics engineering, working with journals from IEEE, OSA/Optica, AIP, APS, etc.). It's faster for me to open the PDF file w/ Chrome, select title, R-click it to google-search and get the DOI. Now, to pass this DOI to PDF2DOI, I presently rename the file using the DOI as a name-string (replacing slashes with dashes), and then R-clicking it with PDF_renamer, done. So, it works with Method#2.

b) The Suggestion: I sometimes also copy the DOI (as URL or plain DOI, with slashes etc) into the top of the first page, for easier reference, as a text-annotation ("typewriter tool") or inside a bubble/note/comment annotation. Could PDF2DOI be made to look into these first-page annotations for the DOI, e.g., during Method#3? It would be really handy (for me)...

Thanks for your time!

MicheleCotrufo commented 1 year ago

Hi Alexandros, nice to meet you! It also looks like we work in closely related fields :) I am very glad to see that my scripts are useful to the community.

a) Yes, indeed methods #4 and #5 are more of a "last desperate attempt". In my (few) tests, they seemed to work well for not-so-old papers, but is very random. Can you send me a few examples of papers for which they don't work? It's always useful to have some examples of hard-to-crack pdf files, to improve the script.

b) This sounds like an interesting suggestion, and it might be an easy addition. Can you send me a few examples of pdf files with annotations?

alexpiti commented 1 year ago

Hi Michele. Wow, ultra-fast response ;-) I think that I first came across your scientific research and then fell on your GitHub community service.

a) I found that Optics Express papers (and Biomedical OpEx, BOE) are the most typical examples that, even when they have a DOI, the algo resorts to Google-searching, with limited chances of success. I am attaching two files: The OpEx paper (test1) fails entirely, whereas the BOE paper (test2) succeeds at Method#4, at the 2nd or 3rd Google result.

b) Both papers attached have the DOI with red text (typewriter) in front page, top-left, for your tests.

And since we're here, two more thingies:

c) Can the ".google-cookie" file be auto-deleted after the program terminates? I am on Windows7 (!) :)

d) Method#4 most times work (if title is successfully extracted) but Method#5 is really random and sometimes leads to wrong results (e.g. paper gets random/wrong DOI). Can I selectively disable Method#5 (but not Method#4) by setting N_characters_in_pdf = 0 in settings.ini ?

MicheleCotrufo commented 1 year ago

a) and b) I ran some quick test on your files. Interestingly, it was able to find the DOI for both of them. In both cases, it finds the DOI in the text of pdf. So I think pdf2doi can already "see" the text of your annotations. This is the log I get when I analyze those two files

pdf2doi examples_hard_files -v -nostore
[pdf2doi]: Looking for pdf files in the folder examples_hard_files...
[pdf2doi]: Found 2 pdf files.
[pdf2doi]: ................
[pdf2doi]: Trying to retrieve a DOI/identifier for the file: examples_hard_files\test1.-.Optics.Express.no.DOI.pdf
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: An error occurred when retrieving the pdf info with PyPDF2: file has not been decrypted
[pdf2doi]: Could not find a valid identifier in the document info.
[pdf2doi]: Method #2: Looking for a valid identifier in the file name...
[pdf2doi]: Could not find a valid identifier in the file name.
[pdf2doi]: Method #3: Looking for a valid identifier in the document text...
[pdf2doi]: Extracting text with the library PyPdf...
[pdf2doi]: Text extracted succesfully. Looking for an identifier in the text...
[pdf2doi]: Could not find a valid identifier in the document text extracted by PyPdf.
[pdf2doi]: Extracting text with the library textract...
[pdf2doi]: Text extracted succesfully. Looking for an identifier in the text...
[pdf2doi]: Standardised DOI: 10.1364/OE.14.001957 -> 10.1364/oe.14.001957
[pdf2doi]: Validating the possible DOI 10.1364/oe.14.001957 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1364/oe.14.001957 is validated by dx.doi.org.
[pdf2doi]: Standardised DOI: 10.1364/OE.14.001957 -> 10.1364/oe.14.001957
[pdf2doi]: A valid DOI was found in the document text.
[pdf2doi]: 10.1364/oe.14.001957
[pdf2doi]: ................
[pdf2doi]: Trying to retrieve a DOI/identifier for the file: examples_hard_files\test2.-.BOE.DOI.is.in.the.first.page.footer.pdf
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Could not find a valid identifier in the document info.
[pdf2doi]: Method #2: Looking for a valid identifier in the file name...
[pdf2doi]: Could not find a valid identifier in the file name.
[pdf2doi]: Method #3: Looking for a valid identifier in the document text...
[pdf2doi]: Extracting text with the library PyPdf...
[pdf2doi]: Text extracted succesfully. Looking for an identifier in the text...
[pdf2doi]: Could not find a valid identifier in the document text extracted by PyPdf.
[pdf2doi]: Extracting text with the library textract...
[pdf2doi]: Text extracted succesfully. Looking for an identifier in the text...
[pdf2doi]: Standardised DOI: 10.1364/BOE.8.005594 -> 10.1364/boe.8.005594
[pdf2doi]: Validating the possible DOI 10.1364/boe.8.005594 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1364/boe.8.005594 is validated by dx.doi.org.
[pdf2doi]: Standardised DOI: 10.1364/BOE.8.005594 -> 10.1364/boe.8.005594
[pdf2doi]: A valid DOI was found in the document text.
[pdf2doi]: 10.1364/boe.8.005594
[pdf2doi]: ................
DOI             10.1364/oe.14.001957                     examples_hard_files\test1.-.Optics.Express.no.DOI.pdf
DOI             10.1364/boe.8.005594                     examples_hard_files\test2.-.BOE.DOI.is.in.the.first.page.footer.pdf

As a second test, I removed the annotation from file 1, and ran again pdf2doi on it. In this case, indeed, it can't find the DOI in the text, but it can find it via the google search,

pdf2doi test3.-.Optics.Express.no.DOI.pdf -nostore -v
[pdf2doi]: Trying to retrieve a DOI/identifier for the file: test3.-.Optics.Express.no.DOI.pdf
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: An error occurred when retrieving the pdf info with PyPDF2: file has not been decrypted
[pdf2doi]: Could not find a valid identifier in the document info.
[pdf2doi]: Method #2: Looking for a valid identifier in the file name...
[pdf2doi]: Could not find a valid identifier in the file name.
[pdf2doi]: Method #3: Looking for a valid identifier in the document text...
[pdf2doi]: Extracting text with the library PyPdf...
[pdf2doi]: Text extracted succesfully. Looking for an identifier in the text...
[pdf2doi]: Could not find a valid identifier in the document text extracted by PyPdf.
[pdf2doi]: Extracting text with the library textract...
[pdf2doi]: Text extracted succesfully. Looking for an identifier in the text...
[pdf2doi]: Could not find a valid identifier in the document text extracted by textract.
[pdf2doi]: Could not find a valid identifier in the document text.
[pdf2doi]: Method #4: Looking for possible publication titles...
[pdf2doi]: An error occurred when retrieving the pdf info with PyPDF2: file has not been decrypted
[pdf2doi]: It was not possible to find a title for this file.
[pdf2doi]: Method #5: Trying to do a google search with the first 1000 characters of this pdf file...
[pdf2doi]: Trying to extract the first 1000 characters from the pdf file by using the library PyPdf...
[pdf2doi]: Doing a google search, looking at the first 6 results...
[pdf2doi]: Performing google search with key "Plasmonic eldenhancementandSERS intheeffectivemodevolumepicture StefanA. Maier Centre for Photonics  ...[query too long, the remaining part is suppressed in the logging]"
[pdf2doi]: and looking at the first 6 results...
[pdf2doi]: Trying to extract the first 1000 characters from the pdf file by using the library textract...
[pdf2doi]: Doing a google search, looking at the first 6 results...
[pdf2doi]: Performing google search with key "Plasmonic field enhancement and SERS  in the effective mode volume picture  Stefan A. Maier  Centre  ...[query too long, the remaining part is suppressed in the logging]"
[pdf2doi]: and looking at the first 6 results...
[pdf2doi]: Looking for a valid identifier in the search result #1 : https://opg.optica.org/abstract.cfm?uri=oe-14-5-1957
[pdf2doi]: Looking for a valid identifier in the search result #2 : https://researchportal.bath.ac.uk/en/publications/plasmonic-field-enhancement-and-sers-in-the-effective-mode-volume
[pdf2doi]: Standardised DOI: 10.1364/OE.14.001957 -> 10.1364/oe.14.001957
[pdf2doi]: Validating the possible DOI 10.1364/oe.14.001957 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1364/oe.14.001957 is validated by dx.doi.org.
[pdf2doi]: Standardised DOI: 10.1364/OE.14.001957 -> 10.1364/oe.14.001957
[pdf2doi]: A valid DOI was found with this google search.
DOI             10.1364/oe.14.001957                     test3.-.Optics.Express.no.DOI.pdf

Can you send me the logs you get when you try to analyze this file, with and without the annotation? Which version of pdf2doi do you have?

c) Indeed, the .google-cookie file is very annoying. I'll look into how it can be removed. d) If I remember correctly, even if you set N_characters_in_pdf = 0 it will still try the Method #5. It should be an easy change to allow the user to disable some methods selectively. I'll work on it soon

alexpiti commented 1 year ago

I installed everything fresh/clean on a new computer (Windows 10/x64, and latest Python 3.11). The version of pdf2doi is 1.04, i.e., the latest.

Ran some tests myself, on the same files, and we have some discrepancies:

For the first PDF (the OpEx), Method#1 fails with/without the annotation, with the same output (note that I have disabled WebSearch):

[pdf-renamer]: File: test1.pdf
[pdf-renamer]: Calling the pdf2bib library to retrieve the bibtex info of this file.
[pdf2bib]: Trying to extract data to generate the BibTeX entry for the file: test1.pdf
[pdf2bib]: Calling pdf2doi...
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: An error occurred when retrieving the pdf info with PyPDF2: file has not been decrypted
[pdf2doi]: Could not find a valid identifier in the document info.
[pdf2doi]: Method #2: Looking for a valid identifier in the file name...
[pdf2doi]: Could not find a valid identifier in the file name.
[pdf2doi]: Method #3: Looking for a valid identifier in the document text...
[pdf2doi]: Extracting text with the library PyPdf...
[pdf2doi]: Text extracted succesfully. Looking for an identifier in the text...
[pdf2doi]: Could not find a valid identifier in the document text extracted by PyPdf.
[pdf2doi]: Extracting text with the library textract...
[pdf2doi]: The command `pdftotext test1.pdf -` failed with exit code 127
------------- stdout -------------
------------- stderr -------------

[pdf2doi]: An error occured while loading the document text with textract. The pdf version might be not supported.
[pdf2doi]: Error from textract: The command `pdftotext test1.pdf -` failed with exit code 127
------------- stdout -------------
------------- stderr -------------

[pdf2doi]: Could not find a valid identifier in the document text.
[pdf2doi]: Method #4: Looking for possible publication titles...
[pdf2doi]: An error occurred when retrieving the pdf info with PyPDF2: file has not been decrypted
[pdf2doi]: It was not possible to find a title for this file.
[pdf2doi]: Method #5: Trying to do a google search with the first 1000 characters of this pdf file...
[pdf2doi]: NOTE: Web-search methods are currently disabled by the user. Enable it in order to use this method.
[pdf2bib]: It was not possible to find a valid identifier for this file.
[pdf-renamer]: The pdf2doi library was not able to find an identifier for this pdf file.
Summaries of changes done:
No file has been renamed.
The following pdf files could not be renamed because it was not possile to automatically find the publication identifier (DOI or arXiv ID). Try to manually add a valid identifier to each file via the command "pdf2doi 'filename.pdf' -id 'valid_identifier'" and then run again pdf-renamer.
test1.pdf

For the second PDF (the BOI), it works with Method#1 with/without the text annotation.

[pdf-renamer]: File: test2.pdf
[pdf-renamer]: Calling the pdf2bib library to retrieve the bibtex info of this file.
[pdf2bib]: Trying to extract data to generate the BibTeX entry for the file: test2.pdf
[pdf2bib]: Calling pdf2doi...
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Validating the possible DOI 10.1364/boe.8.005594 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.1364/boe.8.005594 is validated by dx.doi.org.
[pdf2doi]: A valid DOI was found in the document info labelled '/identifier'.
[pdf2bib]: pdf2doi found a valid identifier for this paper. Trying to parse the data obtained by pdf2doi into valid BibTeX data..
[pdf2bib]: A valid BibTeX entry was generated.
[pdf-renamer]: Found bibtex data and an identifier for this file: 10.1364/boe.8.005594 (DOI).
[pdf-renamer]: Found the following data:
        title = "Detection of optical activity with diode-integrated hyperbolic metasurfaces"
        volume = "8"
        issue = "12"
        page = "5594"
        publisher = "The Optical Society"
        url = "http://dx.doi.org/10.1364/boe.8.005594"
        doi = "10.1364/boe.8.005594"
        journal = "Biomedical Optics Express"
        year = "2017"
        month = "11"
        author = "[{'ORCID': 'http://orcid.org/0000-0002-9193-7805', 'authenticated-orcid': True, 'given': 'Joseph S. T.', 'family': 'Smalley', 'sequence': 'first', 'affiliation': []}, {'given': 'Felipe', 'family': 'Vallini', 'sequence': 'additional', 'affiliation': []}, {'given': 'Yeshaiahu', 'family': 'Fainman', 'sequence': 'additional', 'affiliation': []}]"
[pdf-renamer]: The new file name is .\2017_Smalley @ [Biomed. Opt. Express] Detection of optical activity with diode-integrated hyperbolic metasurfaces.pdf
[pdf-renamer]: File renamed correctly.
Summaries of changes done:
test2.pdf
---> 2017_Smalley @ [Biomed. Opt. Express] Detection of optical activity with diode-integrated hyperbolic metasurfaces.pdf
1 file has been renamed.

And now comes the madness: When the test2.pdf file (renamed the downloaded file) was on the windows desktop, method#1 did not work! When I put it in a folder (e.g. called test, on the desktop), it worked, and that's where the log above came from... I double checked this weird discrepancy a couple of times and it was consistent. Then, 10 minutes later, trying to repro it again as I type this, the pdf2doi fails (same log as test1.pdf above).... x_x

Anyway, I don't want to waste your time. Just let me know if you need more tests.

MicheleCotrufo commented 1 year ago

I assume you meant version 1.4 and not 1.04, right?

Based on your logs, it looks like some error arises when the library textract tries to read the pdf text


[pdf2doi]: Extracting text with the library textract...
[pdf2doi]: The command `pdftotext test1.pdf -` failed with exit code 127
------------- stdout -------------
------------- stderr -------------

[pdf2doi]: An error occured while loading the document text with textract. The pdf version might be not supported. [pdf2doi]: Error from textract: The command pdftotext test1.pdf - failed with exit code 127 ------------- stdout ------------- ------------- stderr -------------

`textract` is a powerful library to extract text from pdf files, but it often gave me issues because of some incompatibility between the versions of the libraries required by `textract` and the version of the same libraries required by other scripts.
Can you check which version of `textract` do you have installed? In fact, would you mind posting the output of your `pip list ` command?

- I can probably clarify the 'madness' :)  From your logs, you can see that when you analyzed the file test2.pdf the DOI was found inside a pdf metadata called '/identifier'

[pdf2doi]: Method #1: Looking for a valid identifier in the document infos... [pdf2doi]: Validating the possible DOI 10.1364/boe.8.005594 via a query to dx.doi.org... [pdf2doi]: The DOI 10.1364/boe.8.005594 is validated by dx.doi.org. [pdf2doi]: A valid DOI was found in the document info labelled '/identifier'.



Whenever `pdf2doi` associates a DOI to a pdf file (either because it finds it with one of the methods, or because the user manually adds it), the DOI is also stored in the pdf metadata, in order to speed up future lookups. So, I believe in this case you experienced a sort of memory effect.

You can "clean up" the stored metadata via

`pdf2doi filename.pdf -id ''`

This will store an empty string ('') in the tag '/identifier'.

Can you try to clean up the file test2.pdf and run the test again? If you add the `-nostore` option when you run `pdf2doi`, it will never save the DOI in the metadata.

alexpiti commented 1 year ago

Hello again, Michele. Thanks for your support!

Yup, my PDF2DOI version is 1.4.post1 (not 1.04). Sorry! :)

Here's the pip list output on my old (Windows7-like server) computer:


Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Users\Administrator>pip list Package Version

anyio 3.6.1 argcomplete 1.10.3 argon2-cffi 21.3.0 argon2-cffi-bindings 21.2.0 asttokens 2.0.5 attrs 21.4.0 Babel 2.10.3 backcall 0.2.0 backports.zoneinfo 0.2.1 beautifulsoup4 4.8.2 bibtexparser 1.3.0 bleach 5.0.1 certifi 2022.6.15 cffi 1.15.1 chardet 3.0.4 charset-normalizer 2.1.0 colorama 0.4.5 compressed-rtf 1.0.6 cycler 0.11.0 debugpy 1.6.2 decorator 5.1.1 defusedxml 0.7.1 docx2txt 0.8 easygui 0.98.3 ebcdic 1.1.1 EbookLib 0.17.1 entrypoints 0.4 executing 0.8.3 extract-msg 0.28.7 fastjsonschema 2.16.1 feedparser 6.0.10 fonttools 4.34.4 google 3.0.0 h5py 3.7.0 idna 3.3 IMAPClient 2.1.0 importlib-metadata 4.12.0 importlib-resources 5.8.0 ipdb 0.13.9 ipykernel 6.15.1 ipython 8.4.0 ipython-genutils 0.2.0 jedi 0.18.1 Jinja2 3.1.2 json5 0.9.8 jsonschema 4.7.2 jupyter-client 7.3.4 jupyter-core 4.11.1 jupyter-server 1.18.1 jupyterlab 3.4.3 jupyterlab-pygments 0.2.2 jupyterlab-server 2.15.0 kiwisolver 1.4.4 lxml 4.9.1 MarkupSafe 2.1.1 matplotlib 3.5.2 matplotlib-inline 0.1.3 mistune 0.8.4 nbclassic 0.4.3 nbclient 0.6.6 nbconvert 6.5.0 nbformat 5.4.0 nest-asyncio 1.5.5 notebook 6.4.12 notebook-shim 0.1.0 numpy 1.23.1 olefile 0.46 packaging 21.3 pandocfilters 1.5.0 parso 0.8.3 pdf-renamer 1.0rc6 pdf2bib 1.0.3 pdf2doi 1.4.post1 pdfminer.six 20191110 pdftitle 0.5 pickleshare 0.7.5 Pillow 9.2.0 pip 22.1.2 plotly 5.9.0 prettytable 3.3.0 prometheus-client 0.14.1 prompt-toolkit 3.0.30 psutil 5.9.1 pure-eval 0.2.2 pycparser 2.21 pycryptodome 3.15.0 Pygments 2.12.0 pyLLE 3.0.1 pyparsing 3.0.9 PyPDF2 2.0.0 pyperclip 1.8.2 pyrsistent 0.18.1 python-dateutil 2.8.2 python-pptx 0.6.21 pytz 2022.1 pytz-deprecation-shim 0.1.0.post0 pywin32 304 pywinpty 2.0.6 pyzmq 23.2.0 requests 2.28.1 scipy 1.8.1 Send2Trash 1.8.0 setuptools 49.2.1 sgmllib3k 1.0.0 six 1.12.0 sniffio 1.2.0 sortedcontainers 2.4.0 soupsieve 2.3.2.post1 SpeechRecognition 3.8.1 stack-data 0.3.0 tenacity 8.0.1 terminado 0.15.0 textract 1.6.4 tinycss2 1.1.1 toml 0.10.2 tornado 6.2 traitlets 5.3.0 typing_extensions 4.3.0 tzdata 2022.2 tzlocal 4.2 Unidecode 1.3.4 urllib3 1.26.10 wcwidth 0.2.5 webencodings 0.5.1 websocket-client 1.3.3 xlrd 1.2.0 XlsxWriter 3.0.3 zipp 3.8.1

[notice] A new release of pip available: 22.1.2 -> 22.3.1



* Thanks for the clarification on the '/identifier' tag, sneakily added. I recall seeing this in your readme, and now I know what it means :) It's indeed really helpful to me, as I often do "re-sweeps" of my PDF files, to update the naming string format, or add a new journal abbreviation. Specifically, I am now testing an option to display authors string as: "First (Last)" if it is has N>=3 authors. For N=2, it shows as First & Second.

*  So, I ran PDFrenamer on the two files again, cleanly re-downloaded (with the DOI in text annotation), on this older system (the one with the ```pip list``` above). Alas, it failed on both, again on textract, just like on my other system. Textract version is 1.6.4.

MicheleCotrufo commented 1 year ago

Thanks for providing all these details. I have been doing some digging around, to understand why textract gives you the 'failed with exit code 127' error. I found this https://stackoverflow.com/questions/63359767/textract-failed-with-exit-code-127-pdftotext-on-windows-10 and I wonder if it's the problem you are having.

If you type pdftotext in your command prompt and press enter, what do you get?

MicheleCotrufo commented 1 year ago

Actually, nevermind. I think I found some good workaround.

1) I realized that there was a bug in the part of the code which looks for the title of the publication (method 4) and then it googles it. Due to this bug, pdf2doi was not correctly finding the titles of those pdf files. Now it can. 2) I was able to make some changes to the code, so that also the library PyPDF2 can see the text annotations (previously it couldnt). In this way, you dont need to rely on the other library, textract. I should be able to release a new tentative version of the pdf2doi with these fixes in a couple of days.

alexpiti commented 1 year ago

Great news! Waiting for next version of PDF2DOI then :) Thanks again!

For the record, calling pdftotext through windows cmd returns: 'pdftotext' is not recognized as an internal or external command, operable program or batch file. So, I suppose that it's missing... From the SO link you posted, there's this little bit in the comments: "The missing program is pdftotext.exe which is part of Poppler. However latest poppler release does not include binaries for Windows, but you can use the package released in msys2 or poppler-windows." Anyway, not doing anything for the moment.

MicheleCotrufo commented 1 year ago

Hey, sorry for the long wait. Can you try to install the new version 1.5rc5 of pdf2doi inside a fresh python installation (or also a virtual environment) ?

pip install pdf2doi==1.5rc5

This version should NOT install textract automatically. Instead, it relies on a different library (pdfminer), which should be as powerful as textract but without the additional problems and increased load of dependencies
This version should also be able to find the text in the annotations inside your pdf files
I also improved the part of the script that looks for possible titles.

Can you run some test on your side? It would be great if you can run it on as many pdf files as possible, and let me know if any error comes up.

Thanks!

alexpiti commented 1 year ago

Hi Michele. No worries and thanks for the update!

So, I did a fresh "over-install" of 1.5rc5 with pip as you said (on an old python 3.8.9 installation), so that it it kept my old PDF2DOI settings (no websearch). Then,

(1) With websearch=off, it found the DOIs in the text-annotations I put in the two PDFs (BOE and OpEx). That's great!

(2) Still with websearch=off, when I stripped the PDF from comments, it could not find either of the two. Note that the BOE has its DOI in the first page footer... So we have a miss here :(

(3) With websearch==on and 6 results, it now found both DOIs. The BOE on 2nd result, the OpEx on 3rd-4th (maybe because "Fi" in "Field" [2nd word in its title] produce an unrecognized glyph [?] which was used in its google-searching). So, all good here, too.

I will keep testing are report back on its behaviour, when/if I stumble on something. But, as it is now, it serves me 100% :1st_place_medal:

MicheleCotrufo commented 1 year ago

Hey, sorry again for late answer - December is always kind of crazy :) I gave a look into this. It seems that, for some reason, there was an issue when I uploaded the version 1.5rc5 on pyPI: it did not truly upload the the latest version of the files. Sorry about that.

I uploaded another version, 1.5rc6, which should have the latest version of the files

pip install pdf2doi==1.5rc6

I installed it on a fresh version of python and I ran some test

With websearch=off, it can find the DOIs of the BOE and OpEx files with the comments
With websearch=off, and with both pdf files stripped of the comments, it can still find the DOI of the BOE paper, but not the one of the OpEx paper. This is expected since the OpEx paper does not contain the DOI anywhere, so the web search is needed.
With websearch=on, it can find the DOI for both papers, with and without comments.

Let me know if you get the same results! I will probably release the v1.5 some time in the next days.

alexpiti commented 1 year ago

Hey there. December crazy here as well! Typical end-of-year/closure-of-deliverables season for academia...

So, I installed 1.5rc6 and ran tests on the two files with the three settings/combinations you did, and got the exact same behaviour. It took google-searching several efforts for the OpEx, but it eventually succeeded. As I said, the problem is due to the weird glyph "fi" in the PDF title (where the "f" and the "i" are merged) that can't be properly extracted, and turns out as an unnamed character "?". Not typical, but not improbable.

Thanks again for the support and wishing you a merry holiday season and a happier new year :)

MicheleCotrufo commented 1 year ago

Nice! Thanks for checking this. Yes, also on my computer it "merges" the "f" and "i" when identifying the title of the paper. It might be due to some old standard used by OSA to craft pdf files... In my case, it can find the DOI in the 4th result of the google search.

Thanks again!

MicheleCotrufo / pdf2doi

[Suggestion] Look into PDF text-annotations for valid DOIs #22