dkpro / dkpro-tc

UIMA-based text classification framework built on top of DKPro Core and DKPro Lab.
https://dkpro.github.io/dkpro-tc/
Other
34 stars 19 forks source link

Writing CASes to a zip archive #135

Open daxenberger opened 9 years ago

daxenberger commented 9 years ago

Originally reported on Google Code with ID 135

DKPro-Core 1.6.1. will support writing to ZIP archives using e.g. BinaryCasWriter. We
should make use of this feature:

[PreprocessingTask]

AnalysisEngineDescription writer = createEngineDescription(BinaryCasWriter.class,
BinaryCasWriter.PARAM_TARGET_LOCATION, "jar:file:" + root + "/archive.zip", 
BinaryCasWriter.PARAM_TYPE_SYSTEM_LOCATION, root + "/typesystem.bin",
BinaryCasWriter.PARAM_FORMAT, "6");

and likewise for the Meta- and FeatureExtractionTasks.

One problem remains: I am not sure whether this makes sense for the BatchTaskCrossValidation,
where we (currently) need to split the overall set of files into various folds (file
sets), that need to be retrieved individually in each fold.

Reported by daxenberger.j on 2014-05-28 12:41:02

daxenberger commented 9 years ago
"root" points to the path on the file system. Unless you have a strong reason to store
the type system outside the ZIP, I suggest you remove the "root" from PARAM_TYPE_SYSTEM_LOCATION
and just set it to "typesystem.bin" (no slash). Relative type system locations are
placed inside the ZIP - absolute locations are placed directly on the file system.

Reported by richard.eckart on 2014-05-28 12:42:45

daxenberger commented 9 years ago
Thanks for the hint. I don't see a reason to store the typesystem outside the ZIP, so
the location should be relative.

Reported by daxenberger.j on 2014-05-28 12:47:58

daxenberger commented 9 years ago

Reported by daxenberger.j on 2014-06-04 16:09:40

daxenberger commented 9 years ago
I wonder, didn't we plan to do this in 0.6.0? 

Reported by richard.eckart on 2014-06-25 15:04:57

daxenberger commented 9 years ago
Because of the problem mentioned in the first post: I'm not sure how to integrate this
with the current Crossvalidation BatchTask.

Reported by daxenberger.j on 2014-06-25 15:09:46

daxenberger commented 9 years ago
Ah, I see. It shouldn't be a big problem but it is probably too much for the 0.6.0 release.

The basic principle should remain the same. We'd just need some extra code to extract
the file names for the folds from the ZIP instead of scanning them from the file system.

Reported by richard.eckart on 2014-06-25 15:11:57

daxenberger commented 9 years ago

Reported by daxenberger.j on 2015-01-06 11:40:17

Horsmann commented 8 years ago

@daxenberger this one can be closed as won't fix now, right?

daxenberger commented 8 years ago

This is independent of the latest changes to CV mode. The idea here was to write all CASes into a zip archive rather than individual files.

Or why did you think it is obsolete?

Horsmann commented 8 years ago

Oh ok, I misunderstood it then. Sry.

Horsmann commented 6 years ago

@reckart Is this feature available now? What exactly is the benefit of writing a single .zip instead of N bin-cas? Both is not human-readable but the naming of the bin-cas by document name allows some visual confirmation that the reader read what it was supposed to read? It helps to understand at least a little bit what TC is doing. Unless this makes processing a lot faster I would rather not have zips?

reckart commented 6 years ago

Should be available.

reckart commented 6 years ago

I don't remember the rationale. Might be to avoid using subfolders in an execution context... or to reduce the number of files which can at times become very large... maybe @daxenberger remembers more.

daxenberger commented 6 years ago

This was certainly to reduce the number of files produce by TC - which can become quite big for larger datasets. The "visual confirmation" issue could be avoided by writing some sort of log(?) file, which records the names of files written to the archive.

Horsmann commented 6 years ago

@reckart Do you have a code-example that writes to .zip?

reckart commented 6 years ago

There are examples in these unit tests: https://github.com/dkpro/dkpro-core/blob/57dc82892d1bb419158eff37119dfaaca0763d8b/dkpro-core-api-io-asl/src/test/java/de/tudarmstadt/ukp/dkpro/core/api/io/JCasFileWriter_ImplBaseTest.java

reckart commented 6 years ago

Actually, it's even in the documentation: https://dkpro.github.io/dkpro-core/releases/1.9.0/docs/user-guide.html#_working_with_zip_archives

Horsmann commented 6 years ago

Hm, when adapting this for the BinaryCasWriter and BinaryCasReader I get a Not in GZIP format exception

writing:
        AnalysisEngineDescription xmiWriter = createEngineDescription(BinaryCasWriter.class,
                BinaryCasWriter.PARAM_TARGET_LOCATION,
                "jar:file:" + aContext.getFolder(output, AccessMode.READWRITE).getPath() + "/data.gz",
                BinaryCasWriter.PARAM_FORMAT, "6+"
                );

reading:
createReaderDescription(BinaryCasReader.class, BinaryCasReader.PARAM_SOURCE_LOCATION,
                    root.getAbsolutePath() + "/data.gz!*.bin");
reckart commented 6 years ago

Looks like during reading, you are missing the jar:file: prefix.

reckart commented 6 years ago

... and mind that these are "zip" files, not "gz" files.