dkpro / dkpro-core

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.
https://dkpro.github.io/dkpro-core
Other
196 stars 67 forks source link

Reader for Reuters-21578 Text Classification Corpus #691

Closed carschno closed 8 years ago

carschno commented 8 years ago

Implement a reader for the Reuters-21578 Text Classification available here: http://www.daviddlewis.com/resources/testcollections/reuters21578/ It is (was) used in many papers, amongst others in Latent Dirichlet Allocation (Blei et al., 2003). It's encoded in a specific SGML format providing the text itself and various metadata, such as the text topics (categories), date, named entities, etc. I suggest to create a new module io.reuters-asl which provides a reader that reads the text into a Cas and stores the metadata in MetaDataStringField annotations, or, where applicable, in the DocumentMetadata.

reckart commented 8 years ago

Sounds good :)

carschno commented 8 years ago

Actually, there are plenty of existing readers, e.g. https://github.com/hltfbk/Excitement-Open-Platform/blob/master/common/src/main/java/eu/excitementproject/eop/common/utilities/corpora/reuters/ I hope it is easier to make of that than implementing from scratch.

reckart commented 8 years ago

Mind that the reader you pointed to is GPL. If there are plenty, an ASL-based one would be nicer. When you copy code from other places, please clearly state the provenance and try to isolate it in a package.

carschno commented 8 years ago

To simplify things, I've used ExtractReuters to convert the SGML files into text files. It's been a bit complicated and the only way I could make this work was through a Shell script provided by Mahout.

I am now going to implement a reader for these text files and will need to figure out a way to include that pre-processing step somehow.

reckart commented 8 years ago

Just found this class in DKPro TC: de.tudarmstadt.ukp.dkpro.tc.examples.io.ReutersCorpusReader

... but somehow this looks like it would be simply reading text files - not SGML.

zesch commented 8 years ago

Sorry for seeing this thread so late (I was knocked out by evil Kindergarten-viruses over the weekend).

The TC reader uses one of the already extracted versions of this corpus that can be found all over the web. Depending on what kind of meta data one wants to use, you might still need the SGML which is more complete.

We should however keep in mind that the dataset is known to have some issues (length bias, special characters that guide the classification etc.)

carschno commented 8 years ago

Thanks for the pointers; however, because I'll need the classes (<TOPICS> tags) sooner or later, I've extended the Lucene ExtractReuters implementation accordingly; it's very ugly though, "parsing" SGML with regular expressions, but a proper parser would require a lot of additional effort due to messed up DTDs etc.

reckart commented 8 years ago

The tests for this new module fail on Jenkins.