brucemiller / LaTeXML

LaTeXML: a TeX and LaTeX to XML/HTML/ePub/MathML translator.
http://dlmf.nist.gov/LaTeXML/
Other
942 stars 101 forks source link

There are some samples or corpus for test-kit? #835

Closed ppKrauss closed 6 years ago

ppKrauss commented 7 years ago

Real-life samples (not only "sampling for demo") are important for:

Example: I need to "see" (with samples) the potential to use LaTeXML in LaTeX-to-JATS conversions...

About JATS and JATS-samples

Ideal is select some samples (eg. 2, 10 or 100 documents), the Latex-manuscripts or Latex-articles that was the source of articles in PubMed Central... A good sample-set depends on community use: little sample-set is good as "standard examples", and larger sample-set can be used as text corpus.

dginev commented 7 years ago

Um, hello. What exactly are you requesting in this issue? LaTeXML has an (integration) test suite already that covers most of the core latex conversion features.

The suite is lacking in post-processing tests, and we welcome external contributions - you can grab a showcase article of choice and run it through JATS, comparing to an ideal test case. Ideally we would have had this already, but with limited time we tend to prioritize core features over the periphery.

As to a large corpus LaTeXML has been run over, there is an active effort (although also low on manpower in recent months) for converting arXiv.org to HTML. A report on which you can find here: https://lists.kwarc.info/pipermail/project-latexml/2016-October/002196.html

And the latest stats here: http://cortex.mathweb.org/corpus/arXMLiv/tex_to_html

ppKrauss commented 7 years ago

Hi @dginev, thanks! My perseptions are about "lacking in post-processing tests", as you say. The issue is a suggestion to enhance this repo (or use a second git repository) with

Perhaps, for a "large corpus LaTeXML" you can use preprint manuscripts that was published in PubMed Central, SciELO and others... Contacting journals and authors. I can help contacting 1 or 2 journals.

brucemiller commented 7 years ago

If you're requesting what I think you're requesting, I'd like to turn the request around! :>

Basically, none of us currently on the project are very familiar with JATS, but we have a (hopefully) good proof of concept. As I recall, it passed validation as JATS documents, but that of course doesn't tell you whether it faithfully captures the semantics of the original document.

So, what we'd really appreciate is for someone who is familiar with JATS and how it's used to apply LaTeXML to a sample of real-life documents to determine how well it's working, what the faults are --- I'm sure there are some. If faults arise, we can try to fix them. Once it's working convincingly, we could easily derive some small unit tests for regression purposes.

You up for that? :>

brucemiller commented 7 years ago

I guess not...

blahah commented 7 years ago

Please reopen this - I will be exploring LaTeXML for arxiv -> JATS conversion, with the goal to produce a ScienceFair datasource.

I will document experiments and progress here, and contribute back any bugs (+fixes) or improvements needed.

If a test corpus is useful, I can contribute that back too.

ppKrauss commented 7 years ago

Hi all, sorry to abandon... My manuscripts will be converted only to "Simple JATS" not real JATS... When obtain complete JATS I will back with files.

Hi @blahah, let see what you offer there (!), I can help with JATS analysis (some quality control).

dginev commented 7 years ago

Oh wow, @blahah thanks for the interest! Let's keep the issue open then and see if we can reap some mutual benefits here. I've been involved in arXiv conversion with latexml so can answer some of the tricky aspects you may encounter on the way. I would offer some of my arXiv latex->html build systme for reuse, but it's not generalized enough and hasn't been maintained yet - but may at least be worthy of a glance in the part where the actual latexml workers are called which is here:

https://github.com/dginev/latexml-plugin-cortex

While the code may not be terribly useful to you, the comments may be an early warning of what could go wrong: https://github.com/dginev/LaTeXML-Plugin-Cortex/blob/master/bin/latexml_worker#L12

blahah commented 7 years ago

thanks @dginev, I've been browsing that code and it was very useful. I may well be back with questions :)

@ppKrauss thanks - we will validate against the JATS DTD and (the ultimate test) check whether it works in the Lens viewer. If you have any other tips for validation I'd welcome them :)

blahah commented 7 years ago

@dginev do you have any stats on the time and (compute) resources your HTML conversion took? Curious to see what I'm letting myself in for :)

ppKrauss commented 7 years ago

Hi @blahah, I suggest also some samples at http://jats4r.org/validator/

PS: I feel a little rusty, but I can do "human analysis" as JATS expert.

blahah commented 7 years ago

Managed to get it working pretty well :)

screen shot 2017-07-07 at 04 02 51 screen shot 2017-07-07 at 04 02 22

A few things to fix but very close!

dginev commented 7 years ago

@dginev do you have any stats on the time and (compute) resources your HTML conversion took? Curious to see what I'm letting myself in for :)

Short answer: It's tricky and slow, too slow without at least 20+ CPUs around.

I do, somewhere in the email archives... here is the last time I publicly shared some data on the LaTeXML list (10.2016): https://lists.kwarc.info/pipermail/project-latexml/2016-October/002196.html

Apparently our arXMLiv-specific email archives are private? Here is an email snippet with runtime stats from January 2016, when I did the last detailed email report w.r.t runtime:

Dear all, The first "dataset" run is now complete.

  • It took almost exactly 101 hours (4 days and 5 hours), or just about 2.82 jobs/second.
    • That means the average arXiv job took 2.5 minutes to convert.
    • CorTeX is officially "stable"!
    • the dispatcher, database and workers processed the entirety of arXiv without a single unforeseen failure. There was zero admin intervention during the run.
    • We had the full array of workers operational from beginning to end, and all workers and HULK machines remain online and healthy after the run. The final success rate is: No Problems 8.13% 83492 Warning 45.81% 470344 Error 36.27% 372436 Fatal 9.78% 100442 http://cortex.mathweb.org/corpus/arXMLiv/tex_to_html (there may be small fluctuations when you got the live site, as ~100 jobs were yet to return and I marked them as timeout to wrap up) I am attaching the numbers from the previous run at [1]. We can record a small deterioration percentage-wise, but given that the current run is a lot more honest about post-processing errors and fatals, as well as imposes a hard 2GB memory limit, this is understandable. While I have stated CorTeX is now stable, it is not yet error-free. I believe a significant portion of the "file not found" errors could be due to the workers cleaning up files too aggressively, but that is yet to be established. Some of the reporting menus in the frontend are currently broken due to URI escaping issues in nickel.js (the web framework), so I'll try to fix them over the weekend, so that we get a better overview of the best venues for improvements for the next rerun. We can also think of redesigning the categorization of certain messages, so that they don't create enormous variety of "what" classes (e.g. missing figure filenames). I intend to package the results of the current run as an "arXMLiv-12-2015" dataset, and follow-up with new dataset releases on a quarterly or 6-month basis, depending on HULK's availability and our progress with improving the conversion rates. Greetings, Deyan

[1] 4th stability run, December 2015

No Problems 8.16% 81005 Warning 47.01% 466706 Error 35.5% 352371 Fatal 9.33% 92625 On 01/18/2016 04:43 PM, Deyan Ginev wrote:

Dear all,

After we observed a few niggles w.r.t error-reporting and worker stability in the last run, Bruce and I added some upgrades that hopefully give us a stable-enough setup to run a first "dataset" run for arXMLiv. Thus I have just started a full rerun from scratch.

All workers are using LaTeXML's latest HEAD (git version 29a47e5d1289edd377592cbe6af8bb74e90f03e8).

We're rerunning 1,025,914 arXiv sources with:

  • 420 CPUs (HULK, beryl, local laptops)
  • 20 minute job timeouts
  • 2 GB RAM memory limit

You can monitor the run at: http://cortex.mathweb.org/corpus/arXMLiv/tex_to_html

Keep in mind the sub-report pages are cached, and you can see the timestamp in the footer to check their freshness. They should be at best a few minutes, and at worst a few hours, behind the main report page, which isn't cached.

Greetings, Deyan

dginev commented 7 years ago

I think this is also a good time to quickly remark that there has been a lot going on "behind the curtains" of the project targeting exactly the performance deficiencies of LaTeXML and there may be (exciting?) developments in that vein later in 2017. But I can't share more for the moment safe for this "light hint".

brucemiller commented 7 years ago

Wow! This is a nice development!! Looking forward to some bug reports :> Thanks!

blahah commented 7 years ago

Thanks @dginev - I do have access to 64+ core machines I can use so that sounds totally achievable.

And excited to see what the secret developments are 😄

dginev commented 7 years ago

Marking this as a documentation enhancement for 0.8.4 (2 releases from now), feel free to send us updates as things progress @blahah ! :+1:

dginev commented 6 years ago

Hey @blahah , could you share if your effort to get arxiv->JATS had some progress / results, and if there are any blockers we can help with?

My research group has upped its hardware capacity a week ago and we seem to have found a viable compromise for research-only redistribution of the HTML5 of arXiv, so I'll be sharing some @kwarc news here soon and I will try to address the requests @ppKrauss had about documenting corpus-level conventions and best practices (at least the ones I arrived at).

The folks at arxiv-vanity (cc @bfirsh ) are now also doing the latexml dance over arxiv, so that makes 3 separate parties working on converting that corpus, and it would be excellent to share notes and upgrades as we go along. Pretty exciting actually. I may be a lot more active on this front in 2018, so this feels like a good time to drop a note here.

blahah commented 6 years ago

@dginev I will put all my code and results online. Basic story is that I got it working pretty nicely but lots of edge cases. I'd be very interested to sync up.

When you say:

we seem to have found a viable compromise for research-only redistribution of the HTML5 of arXiv

do you mean a licensing compromise or a technical one?

The license issue seems to me the biggest one.

dginev commented 6 years ago

@blahah sounds great, and sounds about right about the edge cases - great to mutually solve those.

The compromise solves the licensing problem, but is a "legally technical one" basically mitigating risk by having a dedicated organization do the redistribution with extremely limited purpose (non-commercial + research). We can't really wish away the default arXiv license I am afraid... The only "ultimate" solution remains having Cornell itself hosting the alternative formats, but that still seems to be a long-term perspective only. I am just happy we found some way to make the data available to the wider scientific community, we should be moving from "unavailable" to "slightly inconvenient direct download" soon.

jmnicholson commented 6 years ago

@dginev I'd like to know when "slightly inconvenient" download is ready! Would love to use this for R-factor stuff :)

dginev commented 6 years ago

We have just posted live our arXiv.org 08.2017 HTML5 dataset, together with a token model and word embeddings, intended for redistribution for research and tool development. Advertising them here as requested, and we welcome any and all community feedback:

https://sigmathling.kwarc.info/news/2018/01/24/dataset/

ppKrauss commented 6 years ago

hi @dginev , congratulations on your work!

I am tring to download the arXMLiv_08_2017_no_problem.zip, but "Authorize gl.kwarc.info" fails.

PS: you can use git LFS in a public repo to offer your big ~5Gb file — not need to hide it, there are no cost.

dginev commented 6 years ago

Hi @ppKrauss , this is where the "slightly inconvenient" part of the download comes. You need to sign an NDA with the SIGMathLing organization to be given access to the downloads, which is the legal workaround for mitigating any weird licensing troubles with arXiv (long topic I won't go into here). Detailed instructions here.

For now we are testing that redistribution route so both the dataset and embeddings follow these guidelines. Hopefully in a bright mid-term future we'll have an official path to distribution that won't need NDAs and hassle, sorry for the inconvenience.

PS: The large files are indeed hosted via git LFS in gitlab, but they are hidden for licensing reasons.

dginev commented 6 years ago

I will close this issue for now, feel free to drop a comment or open a new one that is more specific to JATS, given that we covered a lot of ground here.

For now, I added a pointer to the arXMLiv corpus I mentioned in the latexml wiki pages:

https://github.com/brucemiller/LaTeXML/wiki/Interesting-Applications

And will use the main #896 issue for discussing improving the markup documentation. I'm not doing any active JATS work at the moment, so we may want to find a different driver for that.

jmnicholson commented 6 years ago

The link to arXML seems to be dead https://kwarc.info/systems/arXMLiv/ https://kwarc.info/systems/arXMLiv/

I’d like to download the corpus if possible for fair use purposes, can you link here?

On Apr 20, 2018, at 2:15 AM, Deyan Ginev notifications@github.com wrote:

I will close this issue for now, feel free to drop a comment or open a new one that is more specific to JATS, given that we covered a lot of ground here.

For now, I added a pointer to the arXMLiv corpus I mentioned in the latexml wiki pages:

https://github.com/brucemiller/LaTeXML/wiki/Interesting-Applications https://github.com/brucemiller/LaTeXML/wiki/Interesting-Applications And will use the main #896 https://github.com/brucemiller/LaTeXML/issues/896 issue for discussing improving the markup documentation. I'm not doing any active JATS work at the moment, so we may want to find a different driver for that.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/brucemiller/LaTeXML/issues/835#issuecomment-382990473, or mute the thread https://github.com/notifications/unsubscribe-auth/ANEPK6j19Qj6d8bL-NfluavurrA61mcnks5tqXzmgaJpZM4M1690.

dginev commented 6 years ago

Thanks for spotting that dead link, fixed. A bit too much is happening on that wiki page, if you specifically care about the dataset you can find it here: https://sigmathling.kwarc.info/resources/arxmliv-dataset-082017/

and my download explanations are in the comment here https://github.com/brucemiller/LaTeXML/issues/835#issuecomment-360254936