brucemiller / LaTeXML

LaTeXML: a TeX and LaTeX to XML/HTML/ePub/MathML translator.
http://dlmf.nist.gov/LaTeXML/
Other
951 stars 101 forks source link

Archive conversions of .tar.gz sources #1091

Open lab156 opened 5 years ago

lab156 commented 5 years ago

Hi, I am trying to use the unpack_source subroutine from LaTeXML:Util:Test on a gzipped tar file but it produces an I/O error:

#!/usr/bin/perl

use arXiv;
use arXiv::FileGuess qw(guess_file_type);
use LaTeXML::Util::Pack qw(unpack_source);

$dir  = '/path/to/archive';
$extdir = '/path/to/extdir';

@msg = guess_file_type($dir);
print "@msg[0]\n";
print unpack_source($dir, $extdir);

Some of the error messages are:

format error: can't find EOCD signature 

and

Fatal:I/O:Archive Can't read in source archive: 

When I point $dir to a .zip archive it works but I am trying to work with tar files downloaded from the arXiv website, so I cannot control the compression method being used. This mean I am stuck using gzipped archives. Is there a way to use the unpack_source function on tar.gz files or is this not implemented yet?

I am using the latest version (LaTeXML version 0.8.3; revision 22db863d)

dginev commented 5 years ago

Hi @lab156 , accepting only ZIP archives has been the behavior since we added the unpack functionality. At the time this seemed like the reliable standard choice in Perl's ecosystem for me to use - if we find a module that can work with arbitrary archives I would be happy to port over to that (or accept a PR for that).

I've worked with libarchive in the past in Ruby, but haven't seen if there is good CPAN module for it, plus it tends to bring along some installation challenges...

In any case, you are correct that only ZIP is supported by the unpack_source utility.

dginev commented 5 years ago

Ah, if you are working with arXiv sources, you simply have to have your own preprocessing, because they are not ready for immediate ingestion by latexml - some of the files are simple .gzs with a single file for example.

I can offer my CorTeX code which unpacks and repackages as ZIP the arXiv sources as an example of using libarchive to both unpack and repackage: https://github.com/dginev/CorTeX/blob/master/src/importer.rs#L121-L249

dginev commented 5 years ago

Also, while I'm at it, while I do not want to discourage you to build your own arXiv conversion service with latexml - I don't recall if I have asked before as to the reason why you'd like to create one. If you are looking for an HTML5 dataset for arXiv converted by latexml, I would like to advertise the one we've bundled ourselves, which you can find here: https://sigmathling.kwarc.info/resources/arxmliv-dataset-082018/

But I indeed don't mean to discourage other experiments running latexml over arxiv, just trying to see if there are easy wins here.

lab156 commented 5 years ago

Thanks, I already have access to the SigMathLing data and we have considered using it in the Formal Abstracts project. We have also found the GloVe embedding files very useful.

The main reason for learning how to convert arXiv articles is to understand how it works. Also, not everybody in our team has signed the SigMathLing NDA. We are also interested in just the subset of math articles, and the .xml output seems enough for our purposes, so we probably do not need and further processing.

Anyway, this is very much a work in progress and hence I really appreciate all your observations.

dginev commented 5 years ago

Great, thanks for the explanation! I'll leave the issue open for now as an enhancement. It could be an easy entry-level task for a new contributor to simply mirror the ZIP functionality with a module such as Archive::Tar to also support .gz and .tar.gz inputs.

lab156 commented 5 years ago

Yeah, I'm looking into it, but my Perl is not great.

I was thinking to also add a function that works for directories that are uncompressed. I selects the main file and optionally compresses it.

lab156 commented 5 years ago

How can I call unpack_source with the "literal:" protocol? I have tried everything I can think of and could not make it work. I seems to be possible according to this: https://github.com/brucemiller/LaTeXML/blob/22db863d7358d56e197a3845375775714577cc82/lib/LaTeXML/Util/Pack.pm#L30-L35

dginev commented 5 years ago

From what I remember I have an example of using this capacity in the showcase web service. Its upload path in particular transfers over a zip's binary data this way, here is a line link:

https://github.com/dginev/LaTeXML-Plugin-ltxmojo/blob/master/lib/LaTeXML/Plugin/LtxMojo.pm#L97

Live demo is at: https://latexml.mathweb.org/upload

but you can also clone that repository and play around.

dginev commented 3 years ago

@lab156 did you end up doing .tar.gz conversions with latexml some time back? While the general feature sounds nice-to-have, it would be great if it also had a driver. I'm still managing with just the .zip capability, but it's also a force of habit at this point.

lab156 commented 3 years ago

Same here, I figured how to work with .zip archives and moved on. Still think this would be a useful and I might pick it up eventually.

dginev commented 3 years ago

Thanks for letting me know, that gets it downgraded to the "Future (if)" milestone for now :>

I definitely agree it's a "nice to have", but without a compelling need we'll return to it when we get some "nice to have" spare time. Good to hear the ZIP flow is usable on your end!