MicheleCotrufo / pdf2doi

A python library/command-line tool to extract the DOI or other identifiers of a scientific paper from a pdf file.
101 stars 18 forks source link

Pdf reading from file object rather than from path #11

Closed cmartinotti closed 2 years ago

cmartinotti commented 2 years ago

Hello,

Amazing tool, I love it, is there a way to use a file object rather than an absolute path to feed to pdf2doi? Asking because I am trying to modify an app deployed on google cloud services to incorporate pdf2doi, but I can't find a way that doesn't involve downloading the files to local machine, which would be mildly inconvenient. The pdf files are stored on google clouds and it would be more elegant to open them as file objects and then manipulate them rather than to download it to local, run pdf2doi and re-upload the info.

Thank you very much for your work!

MicheleCotrufo commented 2 years ago

Hi, thanks for your feedback! I can give a look into this and see if it can implemented without messing up too much the current code. Can you post here a self contained piece of code to illustrate how you'd access a file object?

cmartinotti commented 2 years ago

Hey thank you for your quick reply! Well in my mind something like this would be ideal:

file=googlelib.readfile(filename)         # equivalent to file=open(filename)  
pdf_file=PyPDF2.PdfFileReader(file)    # OPTIONAL. If needed I can definitely do this passage
pdf2doi.pdf2doi( pdf_file )                  #  or pdf2doi.pdf2doi(file) if the second passage is redundant
MicheleCotrufo commented 2 years ago

I am not familiar with the library googlelib. Does googlelib.readfile(filename) return exactly the same kind of object returned by open(filename) ? Is there a way I can test the output of googlelib.readfile(filename) locally, on my computer?

cmartinotti commented 2 years ago

I am not familiar with the library googlelib. Does googlelib.readfile(filename) return exactly the same kind of object returned by open(filename) ? Is there a way I can test the output of googlelib.readfile(filename) locally, on my computer?

Same kind of file object. googlelib is just a personal library that i use to interact with google cloud, sorry for the confusion. But you can assume it's the same exact kind of file that you get with open(filename) .

cmartinotti commented 2 years ago

I might have actually solved it through the use of the tempfile library! Still have to test it, will do tomorrow, but if you do:

file_obj= open(file_path, "rb") 
with tempfile.NamedTemporaryFile(suffix='.pdf') as tmp:
    with open(tmp.name, "wb") as f_out, file_obj as f_in:
        f_out.write(f_in.read())
   pdf2doi.pdf2doi(tmp.name)

It should actually work fine!

MicheleCotrufo commented 2 years ago

Great that you figured that out! I might still try to implement this functionality because it might speed up the library: right now I have several functions that can find the doi, and each of them receives the file path and separately opens it. So the same file gets open more than once, which probably slows things down. I'll let you know if/when I implement this

MicheleCotrufo commented 2 years ago

Actually, it turned out that wasn't too difficult to add this functionality, I only had to change a few lines of code. Do you mind testing it? You can download the modified version via pip install pdf2doi==1.1rc1

In order to avoid messing too much with the code, I did not implement the change at the level of the pdf2doi.pdf2doi function, but a bit deeper, in the pdf2doi.pdf2doi_singlefile function. This is an internal function that gets called by the function pdf2doi.pdf2doi . In the previous version, pdf2doi.pdf2doi_singlefile accepted only a string (the file path) as an input argument. In the new version, the input argument can be either a string or an opened file object.

file = open(path,'rb')
doi = pdf2doi.pdf2doi_singlefile(file)
file.close()

Keep in mind that, differently from the function pdf2doi.pdf2doi, the function pdf2doi.pdf2doi_singlefile does not do any sanity check on its input, such as checking if the input is a valid path / valid pdf file, and if it's a file and not a folder.

Can you maybe test it a bit and let me know if you get any errors? I haven't had much time to test it yet

cmartinotti commented 2 years ago

Actually, it turned out that wasn't too difficult to add this functionality, I only had to change a few lines of code. Do you mind testing it? You can download the modified version via pip install pdf2doi==1.1rc1

In order to avoid messing too much with the code, I did not implement the change at the level of the pdf2doi.pdf2doi function, but a bit deeper, in the pdf2doi.pdf2doi_singlefile function. This is an internal function that gets called by the function pdf2doi.pdf2doi . In the previous version, pdf2doi.pdf2doi_singlefile accepted only a string (the file path) as an input argument. In the new version, the input argument can be either a string or an opened file object.

file = open(path,'rb')
doi = pdf2doi.pdf2doi_singlefile(file)
file.close()

Keep in mind that, differently from the function pdf2doi.pdf2doi, the function pdf2doi.pdf2doi_singlefile does not do any sanity check on its input, such as checking if the input is a valid path / valid pdf file, and if it's a file and not a folder.

Can you maybe test it a bit and let me know if you get any errors? I haven't had much time to test it yet

Hello, sorry for the delay, but i got Covid and I was out of the game for a while. I tried the version that you provided and it seems to work! Thank you so much! I have found another problem with pdf2doi but I'm, going to open a different issue :)

MicheleCotrufo commented 2 years ago

Hope that you recovered well! Glat that it works, please let me know if you find any bug in the last rc version.