Closed github-throwaway closed 4 months ago
Thank you for the feedback! pdf-renamer
is mainly meant to be used as command-line tool (or from right-click-menu) to rename files, so it will not extract data in a way that is useful for the user (i.e. you cannot easily store the data that it extracts)
I have another tool, pdf2bib , which extracts bibtex info (infact, pdf-renamer
relies on the pdf2bib
library, so if you installed pdf-renamer
you also have pdf2bib
installed already). It can be used either as command-line tool or inside a python script.
You can see how to extract BibTeX data here
The parsed fields returned by pdf2bib
are found in the dictionary keys result['metadata']
and result['bibtex']
. These fields do not contain the abstract (I might implement this in a later version). However, you can give a look to the field result['validation_info']
. This contains raw bibtex data returned by different archives. Depending on how the entry was stored in the online archive, sometimes this will contain the abstract, but often it doesnt.
If you have the DOI of your publication (which can be extracted with pdf2doi, another library used by pdf-renamer
) you can try querying certain archives to retrieve the abstract, but it looks like so far most of the stored entries do not have an abstract associated.
This is a bit tricky. In principle, BiBTex has fields to specify the type of publication. However, pdf-renamer
(and its sister libraries pdf2bib
and pdf2doi
) need to find an identifier (e.g. a DOI) of a pdf file before they can do anything useful with it. A chapter will not have a DOI. A book might, but not always (they have different identifiers)
Not really, also because the citation count will often depends on which database you rely on. If you trust google scholar, that might be easy to implement on your own, see for example here
So result['validation_info']
might also hold the keywords? Interesting.
You can still use pdf-renamer
in a script fairly easily. I currently use it like so:
from pdfrenamer import rename
result = rename(folder_path, format="{YYYY}-{J}-{A3etal}-{T}")
for entry in result:
#do stuff
Yes, you can definitely use pdf-renamer
in a script, what I meant is that it's just not necessarily user-friendly and I did not document it. But glad that you figured it out :)
Indeed, the dictionary result
contains several useful keys, essentially everything that was found out about the paper.
result['validation_info']
contains raw bibtex data (which is used by pdf2bib to extract bibtex data and place it in a more elegant and readable format), so you can try parsing it to extract more data.
Hey Michele,
love the tool! Super useful for my thesis! I've got a couple of questions, though.