IllDepence / unarXive

A data set based on all arXiv publications, pre-processed for NLP, including structured full-text and citation network
MIT License
259 stars 19 forks source link

cleanup converter comments #10

Closed dginev closed 1 year ago

dginev commented 1 year ago

Hi @IllDepence !

Very interesting work, thank you for contributing it openly. I spotted there are two outdated comments for a converter that I'm a co-developer for, so I thought I'd send a quick PR phasing those out.

As an aside: I am quite interested in comparing the quality of XML/HTML produced by different converters over arXiv, especially now that multiple datasets are getting published. Maybe we are a step closer to getting such a comparison started with unarXive presenting the tralics output.

IllDepence commented 1 year ago

Oh wow, nice catch ­— that comment has been there since Oct 2018. Thanks for the PR.

During the development of unarXive we tested both LaTeXML and Tralics and found both to be suitable for our purposes in terms of output quality. While LaTeXML seemed more mature and actively maintained, Tralics was considerably faster, which is why we ended up using it (see Section 6 and Table 3 in your initial journal article.)

dginev commented 1 year ago

found both to be suitable for our purposes in terms of output quality.

Right - this is the portion that I referred may benefit from a large-scale "quality comparison". I expect any research team can only do so much in vetting arXiv conversions, given that there are 2 million documents to inspect.

Usually people who compare the converters don't have the systematic means to draw conclusions over the entirety of arXiv. It is still difficult for us to gauge how well LaTeXML does on its own, and we've been trying for some time.

I co-authored one such comparison study (by now completely outdated) back in 2009, see here. And even back then we were left wanting for a more thorough evaluation. Anyhow, I just wanted to air the thought - this is not an actual request targeted at anyone.