RESQUE-Framework / website

The Research Quality Evaluation Scheme
https://resque-framework.github.io/website/
MIT License
2 stars 3 forks source link

Extract information from PDF #14

Open alpkaanaksu opened 1 year ago

alpkaanaksu commented 1 year ago

Extract DOI (which can then be used to fetch relevant information)

Extract other relevant keywords / links (e.g. for detecting preregistration automatically)

alpkaanaksu commented 1 year ago

I implemented a first version (ee58a54572d078e90938870ff4e97910325712f7).

We open the PDF file, extract its text content, find all DOIs and assume that the first one is the one we are looking for.

It should be fairly easy to implement some kind of keyword/link scan. Reading PDFs and extracting text is much easier than I expected.

https://github.com/nicebread/RESQUE/assets/68744864/cb826b3d-78cc-47b1-9e1a-092d72fbd695

nicebread commented 1 year ago

Wow, this is great!

A minor feature request: When you imported a Pdf, the focus should already be at the new publication (and not on the "author" tab).

A general question is, how much "intelligence" we put into the webform, and how much into the subsequent R analysis script. E.g., one could attempt to fill in some fields based on automatic tools, such as:

https://github.com/quest-bih/oddpub

https://github.com/serghiou/rtransparent

(both are R-packages, but maybe there are alternatives for JS?)

alpkaanaksu commented 1 year ago

I doubt that there are similar packages for JS. But since they are both open source, we could implement a minimal subset of features in JS. (or maybe even the whole thing, we could publish the package)

alpkaanaksu commented 1 year ago

A general question is, how much "intelligence" we put into the webform, and how much into the subsequent R analysis script.

Having just enough 'intelligence' in the webform to fill in as many fields as possible automatically should be enough.

A basic principle could be: If we ask for information which can be found in the text, we should try to extract it from the PDF.

alpkaanaksu commented 1 year ago

When you imported a Pdf, the focus should already be at the new publication (and not on the "author" tab)

Done: d8cfa5b697ccd0cb497605027bf3aea9a1fa6e5f

nicebread commented 1 year ago

Maybe check https://github.com/CeON/CERMINE