Open lab156 opened 5 years ago
Hi @lab156 , accepting only ZIP archives has been the behavior since we added the unpack functionality. At the time this seemed like the reliable standard choice in Perl's ecosystem for me to use - if we find a module that can work with arbitrary archives I would be happy to port over to that (or accept a PR for that).
I've worked with libarchive
in the past in Ruby, but haven't seen if there is good CPAN module for it, plus it tends to bring along some installation challenges...
In any case, you are correct that only ZIP is supported by the unpack_source
utility.
Ah, if you are working with arXiv sources, you simply have to have your own preprocessing, because they are not ready for immediate ingestion by latexml - some of the files are simple .gz
s with a single file for example.
I can offer my CorTeX code which unpacks and repackages as ZIP the arXiv sources as an example of using libarchive to both unpack and repackage: https://github.com/dginev/CorTeX/blob/master/src/importer.rs#L121-L249
Also, while I'm at it, while I do not want to discourage you to build your own arXiv conversion service with latexml - I don't recall if I have asked before as to the reason why you'd like to create one. If you are looking for an HTML5 dataset for arXiv converted by latexml, I would like to advertise the one we've bundled ourselves, which you can find here: https://sigmathling.kwarc.info/resources/arxmliv-dataset-082018/
But I indeed don't mean to discourage other experiments running latexml over arxiv, just trying to see if there are easy wins here.
Thanks, I already have access to the SigMathLing data and we have considered using it in the Formal Abstracts project. We have also found the GloVe embedding files very useful.
The main reason for learning how to convert arXiv articles is to understand how it works. Also, not everybody in our team has signed the SigMathLing NDA. We are also interested in just the subset of math articles, and the .xml output seems enough for our purposes, so we probably do not need and further processing.
Anyway, this is very much a work in progress and hence I really appreciate all your observations.
Great, thanks for the explanation! I'll leave the issue open for now as an enhancement. It could be an easy entry-level task for a new contributor to simply mirror the ZIP functionality with a module such as Archive::Tar
to also support .gz
and .tar.gz
inputs.
Yeah, I'm looking into it, but my Perl is not great.
I was thinking to also add a function that works for directories that are uncompressed. I selects the main file and optionally compresses it.
How can I call unpack_source with the "literal:" protocol? I have tried everything I can think of and could not make it work. I seems to be possible according to this: https://github.com/brucemiller/LaTeXML/blob/22db863d7358d56e197a3845375775714577cc82/lib/LaTeXML/Util/Pack.pm#L30-L35
From what I remember I have an example of using this capacity in the showcase web service. Its upload path in particular transfers over a zip's binary data this way, here is a line link:
https://github.com/dginev/LaTeXML-Plugin-ltxmojo/blob/master/lib/LaTeXML/Plugin/LtxMojo.pm#L97
Live demo is at: https://latexml.mathweb.org/upload
but you can also clone that repository and play around.
@lab156 did you end up doing .tar.gz
conversions with latexml some time back? While the general feature sounds nice-to-have, it would be great if it also had a driver. I'm still managing with just the .zip
capability, but it's also a force of habit at this point.
Same here, I figured how to work with .zip
archives and moved on. Still think this would be a useful and I might pick it up eventually.
Thanks for letting me know, that gets it downgraded to the "Future (if)" milestone for now :>
I definitely agree it's a "nice to have", but without a compelling need we'll return to it when we get some "nice to have" spare time. Good to hear the ZIP flow is usable on your end!
Hi, I am trying to use the unpack_source subroutine from LaTeXML:Util:Test on a gzipped tar file but it produces an I/O error:
Some of the error messages are:
and
When I point $dir to a .zip archive it works but I am trying to work with tar files downloaded from the arXiv website, so I cannot control the compression method being used. This mean I am stuck using gzipped archives. Is there a way to use the unpack_source function on tar.gz files or is this not implemented yet?
I am using the latest version (LaTeXML version 0.8.3; revision 22db863d)