Open daxenberger opened 9 years ago
"root" points to the path on the file system. Unless you have a strong reason to store
the type system outside the ZIP, I suggest you remove the "root" from PARAM_TYPE_SYSTEM_LOCATION
and just set it to "typesystem.bin" (no slash). Relative type system locations are
placed inside the ZIP - absolute locations are placed directly on the file system.
Reported by richard.eckart
on 2014-05-28 12:42:45
Thanks for the hint. I don't see a reason to store the typesystem outside the ZIP, so
the location should be relative.
Reported by daxenberger.j
on 2014-05-28 12:47:58
Reported by daxenberger.j
on 2014-06-04 16:09:40
I wonder, didn't we plan to do this in 0.6.0?
Reported by richard.eckart
on 2014-06-25 15:04:57
Because of the problem mentioned in the first post: I'm not sure how to integrate this
with the current Crossvalidation BatchTask.
Reported by daxenberger.j
on 2014-06-25 15:09:46
Ah, I see. It shouldn't be a big problem but it is probably too much for the 0.6.0 release.
The basic principle should remain the same. We'd just need some extra code to extract
the file names for the folds from the ZIP instead of scanning them from the file system.
Reported by richard.eckart
on 2014-06-25 15:11:57
Reported by daxenberger.j
on 2015-01-06 11:40:17
@daxenberger this one can be closed as won't fix
now, right?
This is independent of the latest changes to CV mode. The idea here was to write all CASes into a zip archive rather than individual files.
Or why did you think it is obsolete?
Oh ok, I misunderstood it then. Sry.
@reckart Is this feature available now? What exactly is the benefit of writing a single .zip instead of N bin-cas? Both is not human-readable but the naming of the bin-cas by document name allows some visual confirmation that the reader read what it was supposed to read? It helps to understand
at least a little bit what TC is doing. Unless this makes processing a lot faster I would rather not have zips?
Should be available.
I don't remember the rationale. Might be to avoid using subfolders in an execution context... or to reduce the number of files which can at times become very large... maybe @daxenberger remembers more.
This was certainly to reduce the number of files produce by TC - which can become quite big for larger datasets. The "visual confirmation" issue could be avoided by writing some sort of log(?) file, which records the names of files written to the archive.
@reckart Do you have a code-example that writes to .zip
?
Actually, it's even in the documentation: https://dkpro.github.io/dkpro-core/releases/1.9.0/docs/user-guide.html#_working_with_zip_archives
Hm, when adapting this for the BinaryCasWriter and BinaryCasReader I get a Not in GZIP format
exception
writing:
AnalysisEngineDescription xmiWriter = createEngineDescription(BinaryCasWriter.class,
BinaryCasWriter.PARAM_TARGET_LOCATION,
"jar:file:" + aContext.getFolder(output, AccessMode.READWRITE).getPath() + "/data.gz",
BinaryCasWriter.PARAM_FORMAT, "6+"
);
reading:
createReaderDescription(BinaryCasReader.class, BinaryCasReader.PARAM_SOURCE_LOCATION,
root.getAbsolutePath() + "/data.gz!*.bin");
Looks like during reading, you are missing the jar:file:
prefix.
... and mind that these are "zip" files, not "gz" files.
Originally reported on Google Code with ID 135
Reported by
daxenberger.j
on 2014-05-28 12:41:02