MicheleCotrufo / pdf2doi

A python library/command-line tool to extract the DOI or other identifiers of a scientific paper from a pdf file.
101 stars 18 forks source link

TypeError: 'NoneType' object is not iterable #15

Closed Don-Yin closed 2 years ago

Don-Yin commented 2 years ago

There appears to be a type error in "finder.py" that only emerges on certain PDF files. This one, for example: paper12.2009_unknown_040916_440842.pdf

A miniumn code snippet for reproducing this error:

from pathlib import Path
import pdf2doi

pdf2doi.config.set("verbose", False)
PDF_name = "paper12.2009_unknown_040916_440842.pdf"
results = pdf2doi.pdf2doi(str(Path("examples", PDF_name)))

Where the PDF is placed in the example folder.

Here is the error message:

Traceback (most recent call last):
  File "/Users/donyin/Desktop/pdf2doi-master/main.py", line 15, in <module>
    results = pdf2doi.pdf2doi(str(Path("examples", i)))
  File "/Users/donyin/Desktop/pdf2doi-master/pdf2doi/main.py", line 90, in pdf2doi
    result = pdf2doi_singlefile(filename)
  File "/Users/donyin/Desktop/pdf2doi-master/pdf2doi/main.py", line 134, in pdf2doi_singlefile
    result = finders.find_identifier(file,method="document_infos",keysToCheckFirst=['/doi','/identfier'])
  File "/Users/donyin/Desktop/pdf2doi-master/pdf2doi/finders.py", line 548, in find_identifier
    identifier, desc, info = finder_methods[method](file,func_validate,**kwargs)
  File "/Users/donyin/Desktop/pdf2doi-master/pdf2doi/finders.py", line 586, in find_identifier_in_pdf_info
    identifier,desc,info = find_identifier_in_text(pdfinfo[key],func_validate)
  File "/Users/donyin/Desktop/pdf2doi-master/pdf2doi/finders.py", line 286, in find_identifier_in_text
    for identifier in identifiers:
TypeError: 'NoneType' object is not iterable

I thought I fixed this error by adding:

if identifiers is None:
     identifiers = []

at line 286 of your "finder.py", so that it becomes:

        #First we look for DOI
        for v in range(len(doi_regexp)):
            identifiers = extract_doi_from_text(text,version=v)
            if identifiers is None: # <- here
                identifiers = [] # <- here
            for identifier in identifiers:
                validation = func_validate(identifier,'doi')
                if validation: 
                    return identifier, 'DOI', validation

But this was a bit hacky and not the proper solution. You'd undoubtedly know more about what's going on, so I thought I'd let you know about this.

And by the way, there are some deprecated syntax that you might want to address:

UserWarning: Page.extractText is deprecated and will be removed in PyPDF2 2.0.0. Use Page.extract_text instead. [_page.py:1003]
UserWarning: addMetadata is deprecated and will be removed in PyPDF2 2.0.0. Use add_metadata instead. [_writer.py:793]
UserWarning: Page.extractText is deprecated and will be removed in PyPDF2 2.0.0. Use Page.extract_text instead. [_page.py:1003]

cheers, Don

MicheleCotrufo commented 2 years ago

Thanks a lot for reporting this! Indeed, it is due to the bug that you pointed out. I am quite puzzled why this bug has "manifested" only now, I would have excepted this situation to occur more often. Maybe I accidentally created it in the last version....

I fixed the bug by making sure that the functions extract_doi_from_text and extract_arxivID_from_text return an empty list instead of None when nothing is found. I also fixed the deprecates sintaxes that you pointed out. Thanks again!

I released the version 1.2rc1 on pypi, would you mind testing it on this pdf file again?

pip install pdf2doi==1.2rc1

Don-Yin commented 2 years ago

Thank you for responding! It is now operational! Well done! Thank you for this fantastic project.

alexmaehon commented 2 years ago

same problem

MicheleCotrufo commented 2 years ago

same problem

alexmaehon, do you still get this error after upgrading to version 1.2rc1? If yes, can you send me the PDF file and the full traceback?

MicheleCotrufo commented 2 years ago

Thank you for responding! It is now operational! Well done! Thank you for this fantastic project.

Excellent, thanks for your help! I will release the 1.2 version later today, then.

MicheleCotrufo commented 2 years ago

This bug was fixed with version 1.2