MicheleCotrufo / pdf-renamer

A python tool to automatically rename the pdf files of scientific publications by looking up the publication metadata on the web.
132 stars 21 forks source link

How to jump the errors and continue? #9

Closed SHEN-Cheng closed 5 months ago

SHEN-Cheng commented 1 year ago

I have a lot of subfolders and some pdf papers can't be recongnized and will have a corruption.

I want to ask you how to jump the errors and continue to change the name?

MicheleCotrufo commented 1 year ago

Hi, thanks for the feedback. Can you post the logs of a situation where these errors occur? I need to understand if these errors are generated (and caught) by pdfrenamer, or by python.

SHEN-Cheng commented 1 year ago

Hi, thanks for the feedback. Can you post the logs of a situation where these errors occur? I need to understand if these errors are generated (and caught) by pdfrenamer, or by python. Thanks, I put the error infomation below.

`[pdf2bib]: Trying to extract data to generate the BibTeX entry for the file: /mnt/c/Users/shen1/OneDrive - University of Gothenburg/paper/AI/Nature-Deep learning for multi-year ENSO forecasts/机器学习简介及其在短临天气预警中的应用(20200325)(1).pdf [pdf2bib]: Calling pdf2doi... [pdf2doi]: Method #1: Looking for a valid identifier in the document infos... [pdf2doi]: Could not find a valid identifier in the document info. [pdf2doi]: Method #2: Looking for a valid identifier in the file name... [pdf2doi]: Could not find a valid identifier in the file name. [pdf2doi]: Method #3: Looking for a valid identifier in the document text... [pdf2doi]: Extracting text with the library PyPdf... [pdf2doi]: Text extracted succesfully. Looking for an identifier in the text... [pdf2doi]: Could not find a valid identifier in the document text extracted by PyPdf. [pdf2doi]: Extracting text with the library textract... [pdf2doi]: Text extracted succesfully. Looking for an identifier in the text... [pdf2doi]: Could not find a valid identifier in the document text extracted by textract. [pdf2doi]: Could not find a valid identifier in the document text. [pdf2doi]: Method #4: Looking for possible publication titles... [pdf2doi]: Found 2 possible title(s). [pdf2doi]: Trying possible title #1 '机器学习简介及其在短临天气预警中的应用(20200325)(1).pdf' [pdf2doi]: Performing google search with key "机器学习简介及其在短临天气预警中的应用(20200325)(1).pdf" [pdf2doi]: and looking at the first 6 results... [pdf2doi]: Looking for a valid identifier in the search result #1 : https://cloud.tencent.com/developer/article/1618154 [pdf2doi]: Looking for a valid identifier in the search result #2 : https://posts.careerengine.us/p/615528aa87641118b581e2e7 [pdf2doi]: Looking for a valid identifier in the search result #3 : http://agbigdata.aiijournal.com/CN/article/downloadArticleFile.do?attachType=PDF&id=15151 [pdf2doi]: Looking for a valid identifier in the search result #4 : http://dqkxxb.cnjournals.org/dqkxxb/article/html/20220501 [pdf2doi]: Validating the possible DOI 10.13878/j.cnki.dqkxxb.20210623003 via a query to dx.doi.org... [pdf2doi]: The DOI 10.13878/j.cnki.dqkxxb.20210623003 is validated by dx.doi.org. [pdf2doi]: A valid DOI was found with this google search. [pdf2bib]: pdf2doi found a valid identifier for this paper. Trying to parse the data obtained by pdf2doi into valid BibTeX data.. Traceback (most recent call last): File "/mnt/d/ubuntu/Conda_2022/envs/py1/lib/python3.7/site-packages/pdfrenamer/main.py", line 125, in rename result = pdf2bib.pdf2bib_singlefile(filename) File "/mnt/d/ubuntu/Conda_2022/envs/py1/lib/python3.7/site-packages/pdf2bib/main.py", line 139, in pdf2bib_singlefile metadata = bibtex_makers.parse_bib_from_dxdoiorg(result['validation_info'], method=pdf2doi.config.get('method_dxdoiorg')) File "/mnt/d/ubuntu/Conda_2022/envs/py1/lib/python3.7/site-packages/pdf2bib/bibtex_makers.py", line 46, in parse_bib_from_dxdoiorg json_dict = json.loads(text) File "/mnt/d/ubuntu/Conda_2022/envs/py1/lib/python3.7/json/init.py", line 338, in loads s, 0) json.decoder.JSONDecodeError: Unexpected UTF-8 BOM (decode using utf-8-sig): line 1 column 1 (char 0)

<traceback object at 0x7fa316e8c140> [pdf-renamer]: Some unexpected error occured while using pdf2bib to process this file: Unexpected UTF-8 BOM (decode using utf-8-sig): line 1 column 1 (char 0) Traceback (most recent call last): File "/mnt/d/ubuntu/Conda_2022/envs/py1/lib/python3.7/site-packages/pdfrenamer/main.py", line 125, in rename result = pdf2bib.pdf2bib_singlefile(filename) File "/mnt/d/ubuntu/Conda_2022/envs/py1/lib/python3.7/site-packages/pdf2bib/main.py", line 139, in pdf2bib_singlefile metadata = bibtex_makers.parse_bib_from_dxdoiorg(result['validation_info'], method=pdf2doi.config.get('method_dxdoiorg')) File "/mnt/d/ubuntu/Conda_2022/envs/py1/lib/python3.7/site-packages/pdf2bib/bibtex_makers.py", line 46, in parse_bib_from_dxdoiorg json_dict = json.loads(text) File "/mnt/d/ubuntu/Conda_2022/envs/py1/lib/python3.7/json/init.py", line 338, in loads s, 0) json.decoder.JSONDecodeError: Unexpected UTF-8 BOM (decode using utf-8-sig): line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/mnt/d/ubuntu/Conda_2022/envs/py1/bin/pdfrenamer", line 8, in sys.exit(main()) File "/mnt/d/ubuntu/Conda_2022/envs/py1/lib/python3.7/site-packages/pdfrenamer/main.py", line 346, in main results = rename(target=target) File "/mnt/d/ubuntu/Conda_2022/envs/py1/lib/python3.7/site-packages/pdfrenamer/main.py", line 103, in rename result = rename(subfolder, format=format,tags=tags) File "/mnt/d/ubuntu/Conda_2022/envs/py1/lib/python3.7/site-packages/pdfrenamer/main.py", line 103, in rename result = rename(subfolder, format=format,tags=tags) File "/mnt/d/ubuntu/Conda_2022/envs/py1/lib/python3.7/site-packages/pdfrenamer/main.py", line 92, in rename result = rename(file, format=format, tags=tags) File "/mnt/d/ubuntu/Conda_2022/envs/py1/lib/python3.7/site-packages/pdfrenamer/main.py", line 163, in rename result['path_new'] = None UnboundLocalError: local variable 'result' referenced before assignment`

Johny-Leo commented 1 year ago

Perhaps you can do something like this for the individual instance of pdf2doi: (in python script)

doi_dict = pdf2doi.pdf2doi( file_name_original+ '.pdf') if doi_dict['identifier'] == None: #### when pdf2doi could not find the appropriate id continue

This will skip the erroneous instances and continue to the next instance in pdf2doi search, which consequently skips the steps of renaming.

Regards!

MicheleCotrufo commented 1 year ago

It looks like the issue is not generated by pdf2doi (in fact, it can find the DOI). Instead it is generated when pdf2bib tries to use the library JSON to parse the raw data obtained by dx.doi.org. Probably, the raw data contains some "invalid" character. I can definitely add an additional try/except block in pdf2bib to make sure that the script keeps going.

Can you send me this specific pdf file, so that I can make more tests?

SHEN-Cheng commented 1 year ago

It looks like the issue is not generated by pdf2doi (in fact, it can find the DOI). Instead it is generated when pdf2bib tries to use the library JSON to parse the raw data obtained by dx.doi.org. Probably, the raw data contains some "invalid" character. I can definitely add an additional try/except block in pdf2bib to make sure that the script keeps going.

Can you send me this specific pdf file, so that I can make more tests?

I realized it's a slide in pdf format, so maybe this is why pdf2bib is blocked. I have another question: how to skip these pdf files by checking if it is a journal paper?

MicheleCotrufo commented 1 year ago

Unfortunately there is not a simple way to automatically check if a pdf file is a journal paper. There isn't any unequivocal 'settings' or other stuff that will tell you if the pdf file is a paper or not. The library pdf2bib uses the the library pdf2doi to try to associate a DOI to a given pdf file. If this is successful, pdf2bib will assume that the pdf file is indeed a journal paper corresponding to the found DOI. If not, it will skip it.

In this case, pdf2bib has determined that the file is a valid paper because a DOI was associated to it. But then an error occurred when it tried to parse the raw text. This is probably because the raw text contain some weird character. I will implement some additional code to make sure that, when some parsing error occur, the file is skipped.

MicheleCotrufo commented 1 year ago

I implemented several changes to pdfrenamer, pdf2bib and pdf2doi. I fixed several issues, such as making sure that the processing of pdf files does not stop if one "invalid" pdf file is encountered.

Do you mind installing the latest versions of pdf-renamer via pip install pdf-renamer==1.0rc9 ? This should automatically install also pdf2doi==1.5rc8 and pdf2bib==1.1rc4, but please double check via pip list.

Can you then try to run pdfrenamer on the same folder that you tried earlier? (that is, the one containing pdf files with invalid characters?)

Thanks!

MicheleCotrufo commented 5 months ago

Haven't heard anything back, so I will assume it works and close this issue