Add support for arXiv entries and non-PDF URLs

gcushen commented 3 years ago

The title states Add support for URLs... but actually logic for URLs is already implemented in the software and your PR is changing the behaviour of the url entry logic to give it a label named "URL" rather than "PDF"? How do you propose to address this change with existing users of the software who are using it for PDFs?

Can you link to where archiveprefix appears in the Bibtex spec?

Also, the PR is failing the checks and would need fixing before it can be merged.

k4rtik commented 3 years ago

Hi @gcushen, thanks for taking a look. I looked at your comment only after noticing the build failure and pushing the fix for that.

About "URL" vs the "PDF" label, I think currently the choice to apply the PDF label to any URL without validation that it points to a PDF is incorrect. See, for example, a large bibliography database that I maintain at http://ks.cs.uchicago.edu/qpl-bib/ which is generated using the tool bibtex2html which makes a better decision by choosing the labels ".pdf", "http" or "https", depending on the kind of URL it encounters. I am slowly moving over that bibliography to a Hugo-based system running on wowchemy (https://quantumpl.github.io) and notice these inconsistencies. Would you like that kind of smarter distinction to go along with this PR? (I believe that will also take care of your concern about backward compatibility.)

About archiveprefix field, it is a non-standard bibtex field (just like url) most commonly used for arXiv e-prints as you can see while exporting bibtex for any paper from their web interface. It is worth supporting in the tool; a lot of large research communities such as math, physics, and CS depend on arXiv to provide archival (and open access) versions of their research. See for example https://arxiv.org/abs/1402.4467

gcushen commented 3 years ago

Yes, we should check if the URL ends in .pdf (case-insensitive) to maintain backward compatibility for users. Let's keep the labels user friendly though and not label links as tech protocols like HTTP and HTTPS.

If arxiv.org are generating bibtex with non-standard archiveprefix, then let's support it. The challenge for contributors and maintainers is that we are effectively creating a new (undocumented) Bibtex standard from all these non-standard fields rather than adhering to a clearly defined existing spec...

k4rtik commented 3 years ago

Alright, I don't see the option to convert this into a draft PR, but I will try and make the changes and let you know when it's ready for a potential merge.

I am not sure what is the clearly defined existing spec that you are referring to. bibtex is really old, even url field is a non-standard field. The future is certainly with biblatex that I hope every major publisher starts supporting, but until then we need to stick with what the norms in major communities are.

arXiv is large enough that biblatex provides aliases for new fields that it has introduced for arXiv compatibility, see sec. 3.14.7 Electronic Publishing Information at https://ctan.mirrors.hoobly.com/macros/latex/contrib/biblatex/doc/biblatex.pdf :

There are two aliases which ease the integration of arXiv entries. archiveprefix is treated as an alias for eprinttype; primaryclass is an alias for eprintclass. If hyperlinks are enabled, the eprint identifier will be transformed into a link to arxiv.org.

k4rtik commented 3 years ago

Hi @gcushen, I have made the change as you suggested. This PR is now ready for merge.

GetRD / academic-file-converter

Add support for arXiv entries and non-PDF URLs #97