[BUG] Cannot import PDF files

brant-ruan commented 7 months ago

Describe the bug A clear and concise description of what the bug is.

Occasionally I find some papers (in PDF format) cannot be imported into paperlib, even if it can be opened in PDF readers. Sometimes paperlib may report invalid PDF log at the bottom on the left, but not always.

For example, the paper below cannot be imported (either by downloading it and importing manually or by importing with the chrome plugin):

https://mengrj.github.io/files/CCS23.pdf

To Reproduce Steps to reproduce the behavior:

Download the PDF at https://mengrj.github.io/files/CCS23.pdf
Drag the file and drop it in paperlib
Can not find the paper after paperlib finishes parsing.
See error log (if any)

System (please complete the following information):

OS: MacOS
Paperlib Version 2.2.6

GeoffreyChen777 commented 7 months ago

Hi, I cannot reproduce this issue. Can you provide the error notification?

I can import this paper to my lib, but the metadata is wrong. The reason is that this paper used a wrong DOI https://doi.org/10.1145/nnnnnnn.nnnnnnn

brant-ruan commented 7 months ago

Hi, I cannot reproduce this issue. Can you provide the error notification?

I can import this paper to my lib, but the metadata is wrong. The reason is that this paper used a wrong DOI https://doi.org/10.1145/nnnnnnn.nnnnnnn

Thanks for pointing out the issue.

I use advanced search and find another paper with https://doi.org/10.1145/nnnnnnn.nnnnnnn in the paper. Seems this phenomenon is not common, but if some published papers didn't update this DOI code, paperlib will consider them as the same paper, which is "TestRank: Bringing Order into Unlabeled Test Instances for Deep Learning Tasks" (NeurIPS).

For this special case, I can not modify the DOI and run scraping, as paperlib will still get the DOI from the paper and fetch information with it.

GeoffreyChen777 commented 7 months ago

Currently, please manually edit the metadata of papers with such DOIs.

Tomorrow I have a conference deadline. After that, I will investigate and fix this issue ASAP.

brant-ruan commented 7 months ago

Thanks. Wishing you all the best for your paper's acceptance at the conference :-)

GeoffreyChen777 commented 7 months ago

@brant-ruan Hi, this issue has been fixed now.

I implemented an invalid doi checking process for the metadata server. However, we can only get the title and author list of this paper currently. I found that this is a very recent publication. No database records this paper until now.

For conference papers, it's common that we need to wait at least half to one year before those databases record them. I usually collect the recently accepted papers in my own research field and insert them into the metadata server database manually. But I cannot do that for all research fields.

I'm thinking, maybe creating a GitHub repo to store some lists of publications and letting the metadata server connect to this repo is a good idea. Let users submit a list of papers and create a pull request should be acceptable.

Best wishes.

brant-ruan commented 7 months ago

@brant-ruan Hi, this issue has been fixed now.

I implemented an invalid doi checking process for the metadata server. However, we can only get the title and author list of this paper currently. I found that this is a very recent publication. No database records this paper until now.

For conference papers, it's common that we need to wait at least half to one year before those databases record them. I usually collect the recently accepted papers in my own research field and insert them into the metadata server database manually. But I cannot do that for all research fields.

I'm thinking, maybe creating a GitHub repo to store some lists of publications and letting the metadata server connect to this repo is a good idea. Let users submit a list of papers and create a pull request should be acceptable.

Best wishes.

Agree.

There is a common situation (at least for me) when I search for papers with search engine and get two download sources: 1) the publication database 2) the author's academic home page or the institution's page. The second one sometimes provides pre-publication versions (or something like that) without further updating and valid DOI. As the contents from both sources are usually identical, and the second source often becomes available earlier than the database, I will download from it.

The GitHub repo idea is great. I am very glad to contribute to the information security field.

GeoffreyChen777 commented 5 months ago

Hi @brant-ruan, I've created a GitHub repository for the community to contribute to the metadata database.

https://github.com/Future-Scholars/paperlib-community-metadata-collection

Just create a json file containing the metadata you want to introduce and raise a PR to this repo.

Once the PR is merged, the data in the json will be inserted to our metadata database.

By doing so, you can scrape corresponding metadata in Paperlib.

brant-ruan commented 5 months ago

Great! Thanks @GeoffreyChen777 , I will check it after the submission deadline :-)

Future-Scholars / paperlib

[BUG] Cannot import PDF files #310