Closed carschno closed 8 years ago
Sounds good :)
Actually, there are plenty of existing readers, e.g. https://github.com/hltfbk/Excitement-Open-Platform/blob/master/common/src/main/java/eu/excitementproject/eop/common/utilities/corpora/reuters/ I hope it is easier to make of that than implementing from scratch.
Mind that the reader you pointed to is GPL. If there are plenty, an ASL-based one would be nicer. When you copy code from other places, please clearly state the provenance and try to isolate it in a package.
To simplify things, I've used ExtractReuters to convert the SGML files into text files. It's been a bit complicated and the only way I could make this work was through a Shell script provided by Mahout.
I am now going to implement a reader for these text files and will need to figure out a way to include that pre-processing step somehow.
Just found this class in DKPro TC: de.tudarmstadt.ukp.dkpro.tc.examples.io.ReutersCorpusReader
... but somehow this looks like it would be simply reading text files - not SGML.
Sorry for seeing this thread so late (I was knocked out by evil Kindergarten-viruses over the weekend).
The TC reader uses one of the already extracted versions of this corpus that can be found all over the web. Depending on what kind of meta data one wants to use, you might still need the SGML which is more complete.
We should however keep in mind that the dataset is known to have some issues (length bias, special characters that guide the classification etc.)
Thanks for the pointers; however, because I'll need the classes (<TOPICS>
tags) sooner or later, I've extended the Lucene ExtractReuters
implementation accordingly; it's very ugly though, "parsing" SGML with regular expressions, but a proper parser would require a lot of additional effort due to messed up DTDs etc.
The tests for this new module fail on Jenkins.
Implement a reader for the Reuters-21578 Text Classification available here: http://www.daviddlewis.com/resources/testcollections/reuters21578/ It is (was) used in many papers, amongst others in Latent Dirichlet Allocation (Blei et al., 2003). It's encoded in a specific SGML format providing the text itself and various metadata, such as the text topics (categories), date, named entities, etc. I suggest to create a new module io.reuters-asl which provides a reader that reads the text into a Cas and stores the metadata in
MetaDataStringField
annotations, or, where applicable, in theDocumentMetadata
.