CDRH / abbot

A tool for facilitating interoperability among XML-encoded text collections.
Other
10 stars 2 forks source link

Abbot skips files when processing large batches #8

Open bzillig1 opened 12 years ago

bzillig1 commented 12 years ago

I've noticed that Abbot (abbot-0.4.1-standalone.jar) seems to skip certain input files when processing batches of larger than 100 files. For example, I pointed it to a directory with ~1,800 ECCO files and Abbot processed only 613 of them. I tried re-processing progressively larger batches and somewhere between 100 and 900 files the problem occurs.

sramsay commented 12 years ago

Yikes! I'll check this out.

sramsay commented 12 years ago

I can't reproduce this. I took 1077 unadorned texts from ecco, and it returned 1077.

Are you trying to do this on your laptop (as opposed to on abbot)? And if so, can you tell me what version of Java you're running?

Also, could you tell me which directory you're pointing at? Something in /opt/corpora?

bzillig1 commented 12 years ago

The problem occurs on my laptop and on the server, but only when there are files present that are not well-formed. So I think it's not related to the batch size after all, but whether saxon encounters an error from which it can't recover (such as "SXXP0003: Error reported by XML parser: Premature end of file."). This sort of error seems to kill the JVM. Is there a way that Abbot can isolate transformations so that one bad file won't stop other valid files from being processed? The corpus I'm using is on the abbot server here: /opt/corpora/tcp/ecco_with_headers/

sramsay commented 12 years ago

Yes. And I'm glad to hear it's this, and not some more frightening error in the parallelization scheme. This is, of course, related to the issue of quarantining, though it sounds as if I need to trap for well-formedness regardless.

I will put this at the top of the queue.