IllDepence / unarXive

A data set based on all arXiv publications, pre-processed for NLP, including structured full-text and citation network
MIT License
259 stars 19 forks source link

FORMULAS #2

Closed LuCeHe closed 4 years ago

LuCeHe commented 4 years ago

Hi, congrats for the very nice work!

I saw that the formulas are not downloaded and I think many would be interested in that part of the articles. If you don't have access to the original latex, you can consider using this library

https://mathpix.com/

to extract the formulas from any format, like from the pdf of the article.

IllDepence commented 4 years ago

Hi, thank you for the input. The data set is derived from the LaTeX source files, which are parsed using Tralics. Given that a large portion of arXiv is physics and mathematics documents (and therefore contains a lot of formulas), we did consider retaining formulas in some form. However, as the primary focus during development was on textual contexts of citation markers, we opted to replace formulas with a simple placeholder token for the moment. Keeping formulas in some markup format (e.g. MathML) or the origial LaTeX or placing links to these contents in the generated plain text files is an option we're considering for the future though.