dkpro / dkpro-bigdata

DKPro large scale processing support
https://dkpro.github.io/dkpro-bigdata
Other
4 stars 3 forks source link

BinCasSerializer needs Source Typesystem #2

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
To be able to deal with different Typesystems, BinCasWritable needs to store 
the Typesystem alongside the data.

This can be either as header to each document (highly renundant) or as a 
seperate file alongside each sequencefile on hdfs.

Original issue reported on code.google.com by hpz...@gmail.com on 13 Sep 2013 at 4:50

GoogleCodeExporter commented 9 years ago
Fixed by introducing BinCasWithTypesystemWritable, which serializes a 
compressed Typesystem with each CAS.

Those files are considerable larger than XMI compressed with Snappy. 

The user can always choose to use the old format by setting 
job.setOutPutValueClass(CASWritable.class)

TODO: store the typesystem seperately, this will be the optimal solution.

Original comment by hpz...@gmail.com on 30 Sep 2013 at 4:27