MicheleCotrufo / pdf-renamer

A python tool to automatically rename the pdf files of scientific publications by looking up the publication metadata on the web.
132 stars 21 forks source link

Error occurs when twice renaming #10

Closed yuriever closed 1 year ago

yuriever commented 1 year ago

Hi! Thanks for this very useful tool!

I installed the following version:

pip install pdf-renamer==1.0rc9

and met the following three issues:

Error occurs when twice renaming

I tried rename the same pdf twice with different format

pdfrenamer 'test' -f '{YYYY}{MM} {T}, {A3etal}' -fr
pdfrenamer 'test' -f '{YYYY} {T}, {A3etal}' -fr 

and the second fails. The error messages are

Last login: Tue Jan  3 21:50:12 on ttys002
myname@myname-MacBook-Air Downloads % pdfrenamer 'test' -f '{YYYY}{MM} {T}, {A3etal}' -fr
[pdf-renamer]: Looking for pdf files and subfolders in the folder test...
[pdf-renamer]: Found 1 pdf file(s).
[pdf-renamer]: ................
[pdf-renamer]: File: test/2208.14233 Top-down holography in an asymptotically flat spacetime, Kevin Costello.pdf
[pdf-renamer]: Calling the pdf2bib library to retrieve the bibtex info of this file.
[pdf2bib]: Trying to extract data to generate the BibTeX entry for the file: test/2208.14233 Top-down holography in an asymptotically flat spacetime, Kevin Costello.pdf
[pdf2bib]: Calling pdf2doi...
[pdf2doi]: Trying to retrieve a DOI/identifier for the file: test/2208.14233 Top-down holography in an asymptotically flat spacetime, Kevin Costello.pdf
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Could not find a valid identifier in the document info.
[pdf2doi]: Method #2: Looking for a valid identifier in the file name...
[pdf2doi]: Could not find a valid identifier in the file name.
[pdf2doi]: Method #3: Looking for a valid identifier in the document text...
[pdf2doi]: Extracting text with the library PyPdf...
[pdf2doi]: Text extracted succesfully. Looking for an identifier in the text...
[pdf2doi]: Validating the possible arxiv ID 2208.14233 via a query to export.arxiv.org...
[pdf2doi]: The arXiv ID 2208.14233 is validated by export.arxiv.org
[pdf2doi]: A valid arxiv ID was found in the document text.
[pdf2doi]: The arXiv ID will be replaced by the arXiv DOI 10.48550/arXiv.2208.14233. If you prefer to keep the arXiv ID, use the command -no_arxiv2doi when invoking pdf2doi
[pdf2doi]: Trying to add the tag '/pdf2doi_identifier'-> '10.48550/arXiv.2208.14233' into the metadata of the file 'test/2208.14233 Top-down holography in an asymptotically flat spacetime, Kevin Costello.pdf'...
[pdf2doi]: The tag '/pdf2doi_identifier'-> '10.48550/arXiv.2208.14233' was added succesfully to the metadata of the file 'test/2208.14233 Top-down holography in an asymptotically flat spacetime, Kevin Costello.pdf'...
[pdf2bib]: pdf2doi found a valid identifier for this paper.
[pdf2bib]: Parsing the info returned by export.arxiv.org...
[pdf2bib]: A valid BibTeX entry was generated.
[pdf-renamer]: Found bibtex data and an identifier for this file: 10.48550/arXiv.2208.14233 (arxiv DOI).
[pdf-renamer]: Found the following data:
    title = "Top-down holography in an asymptotically flat spacetime"
    published = "2022-08-30T13:01:19Z"
    ejournal = "arXiv"
    ENTRYTYPE = "article"
    url = "http://arxiv.org/abs/2208.14233v1"
    doi = "None"
    year = "2022"
    month = "08"
    day = "30"
    author = "[{'given': 'Kevin', 'family': 'Costello'}, {'given': 'Natalie M.', 'family': 'Paquette'}, {'given': 'Atul', 'family': 'Sharma'}]"
[pdf-renamer]: The new file name is test/202208 Top-down holography in an asymptotically flat spacetime, Costello, Paquette, Sharma.pdf
[pdf-renamer]: File renamed correctly.
[pdf2doi]: Trying to add the tag '/pdfrenamer_nameformat'-> '{YYYY}{MM} {T}, {A3etal}' into the metadata of the file 'test/202208 Top-down holography in an asymptotically flat spacetime, Costello, Paquette, Sharma.pdf'...
[pdf2doi]: The tag '/pdfrenamer_nameformat'-> '{YYYY}{MM} {T}, {A3etal}' was added succesfully to the metadata of the file 'test/202208 Top-down holography in an asymptotically flat spacetime, Costello, Paquette, Sharma.pdf'...
[pdf-renamer]: ................
Summaries of changes done:
2208.14233 Top-down holography in an asymptotically flat spacetime, Kevin Costello.pdf
---> 202208 Top-down holography in an asymptotically flat spacetime, Costello, Paquette, Sharma.pdf
1 file has been renamed.
myname@myname-MacBook-Air Downloads % pdfrenamer 'test' -f '{YYYY} {T}, {A3etal}' -fr 
[pdf-renamer]: Looking for pdf files and subfolders in the folder test...
[pdf-renamer]: Found 1 pdf file(s).
[pdf-renamer]: ................
[pdf-renamer]: File: test/202208 Top-down holography in an asymptotically flat spacetime, Costello, Paquette, Sharma.pdf
[pdf-renamer]: Calling the pdf2bib library to retrieve the bibtex info of this file.
[pdf2bib]: Trying to extract data to generate the BibTeX entry for the file: test/202208 Top-down holography in an asymptotically flat spacetime, Costello, Paquette, Sharma.pdf
[pdf2bib]: Calling pdf2doi...
[pdf2doi]: Trying to retrieve a DOI/identifier for the file: test/202208 Top-down holography in an asymptotically flat spacetime, Costello, Paquette, Sharma.pdf
[pdf2doi]: Method #1: Looking for a valid identifier in the document infos...
[pdf2doi]: Standardised DOI: 10.48550/arXiv.2208.14233 -> 10.48550/arxiv.2208.14233
[pdf2doi]: Validating the possible DOI 10.48550/arxiv.2208.14233 via a query to dx.doi.org...
[pdf2doi]: The DOI 10.48550/arxiv.2208.14233 is validated by dx.doi.org.
[pdf2doi]: Standardised DOI: 10.48550/arXiv.2208.14233 -> 10.48550/arxiv.2208.14233
[pdf2doi]: A valid DOI was found in the document info labelled '/pdf2doi_identifier'.
[pdf2bib]: pdf2doi found a valid identifier for this paper.
[pdf2bib]: Parsing the info returned by dx.doi.org...
A problem occurred when trying to parse the text: Expecting value: line 1 column 1 (char 0)
[pdf2bib]: Some error occurred when parsing the raw BibTeX data.
[pdf-renamer]: The pdf2doi library was not able to find an identifier for this pdf file.
[pdf-renamer]: ................
Summaries of changes done:
No file has been renamed.
myname@myname-MacBook-Air Downloads % 

Old arxiv-id is not recognized

When renaming old arxiv papers pdf2doi will try searching google. Unfortunately google is blocked here and this relates to pdf2doi/issues/16

Adding arxiv-id in the file name

Could you provide a tag of arxiv-id? I want to rename the pdfs like

{arxiv-id} {T}, {A3etal}

but find this is not supported in pdfrenamer/filename_creators.py

yuriever commented 1 year ago

The google search can be disabled by changing the default settings.ini of pdf2doi

websearch = False
MicheleCotrufo commented 1 year ago

Hey! Sorry it took me so long to answer this.

yuriever commented 1 year ago

Thanks for the reply. The original file is 2208.14233 Top-down holography in an asymptotically flat spacetime, Kevin Costello.pdf, which can be found here. The renamed file is 2208 Top-down holography in an asymptotically flat spacetime, Costello, Paquette, Sharma.pdf The version is pdf-renamer==1.0rc9.

Hey! Sorry it took me so long to answer this.

  • Yes, adding the arxiv-id tag is a great idea, I will add it to the next version
  • Yes, indeed you can disable the google search by tweaking the .ini file of pdf2doi. But this is a good point, I will try to add an option to be able to disable it directly from pdf-renamer
  • Regarding the other error you mentioned, it looks like the first time you tried to rename the file, pdfrenamer found the bibtex info from export.arxiv.org, and it was able to correctly parse them. The second time, it instead used dx.doi.org to validate the DOI and get the bibtex info. For some reason, this time the raw bibtex info (from dx.doi.org) were not parsed correctly. I am gonna need to run some test, can you send me the pdf file? Also, which version of pdfrenamer are you using?
yuriever commented 1 year ago

I also found another issue, but am not sure it's about pdf-renamer. The original file before renaming has table of contents and hyperlinks to references (the former example doesn't have TOC, so I choose another pdf here.) 2201.01630v1 Chaos in Celestial CFT, Sabrina Pasterski.pdf but after renaming and other operations (e.g. sync tools, being opened by other pdf reader.), the table of contents and hyperlinks are broken. 2208 Chaos in celestial CFT, Pasterski, Verlinde.pdf You see the right original one has TOC, while the left does not.

Screenshot 2023-01-31 at 01 46 56

I'm totally unfamiliar with the structure of pdfs. This issue may depend on my environment. Do you have any idea about this?