MicheleCotrufo / pdf-renamer

A python tool to automatically rename the pdf files of scientific publications by looking up the publication metadata on the web.
132 stars 21 forks source link

Additional information available? #1

Closed github-throwaway closed 4 months ago

github-throwaway commented 2 years ago

Hey Michele,

love the tool! Super useful for my thesis! I've got a couple of questions, though.

  1. Do you also extract the keywords? I know the abstract is available for example
  2. Can I extract the type of publication? chapter, journal, book etc? Maybe through the bibtex type somehow?
  3. Can you also access the citation count?
MicheleCotrufo commented 2 years ago

Thank you for the feedback! pdf-renamer is mainly meant to be used as command-line tool (or from right-click-menu) to rename files, so it will not extract data in a way that is useful for the user (i.e. you cannot easily store the data that it extracts)

I have another tool, pdf2bib , which extracts bibtex info (infact, pdf-renamer relies on the pdf2bib library, so if you installed pdf-renamer you also have pdf2bib installed already). It can be used either as command-line tool or inside a python script.

You can see how to extract BibTeX data here

  1. The parsed fields returned by pdf2bib are found in the dictionary keys result['metadata'] and result['bibtex']. These fields do not contain the abstract (I might implement this in a later version). However, you can give a look to the field result['validation_info']. This contains raw bibtex data returned by different archives. Depending on how the entry was stored in the online archive, sometimes this will contain the abstract, but often it doesnt. If you have the DOI of your publication (which can be extracted with pdf2doi, another library used by pdf-renamer) you can try querying certain archives to retrieve the abstract, but it looks like so far most of the stored entries do not have an abstract associated.

  2. This is a bit tricky. In principle, BiBTex has fields to specify the type of publication. However, pdf-renamer (and its sister libraries pdf2bib and pdf2doi) need to find an identifier (e.g. a DOI) of a pdf file before they can do anything useful with it. A chapter will not have a DOI. A book might, but not always (they have different identifiers)

  3. Not really, also because the citation count will often depends on which database you rely on. If you trust google scholar, that might be easy to implement on your own, see for example here

github-throwaway commented 2 years ago

So result['validation_info'] might also hold the keywords? Interesting.

You can still use pdf-renamerin a script fairly easily. I currently use it like so:

from pdfrenamer import rename
result = rename(folder_path, format="{YYYY}-{J}-{A3etal}-{T}")

    for entry in result:
        #do stuff  
MicheleCotrufo commented 2 years ago

Yes, you can definitely use pdf-renamer in a script, what I meant is that it's just not necessarily user-friendly and I did not document it. But glad that you figured it out :)

Indeed, the dictionary result contains several useful keys, essentially everything that was found out about the paper. result['validation_info'] contains raw bibtex data (which is used by pdf2bib to extract bibtex data and place it in a more elegant and readable format), so you can try parsing it to extract more data.