CogComp / cogcomp-nlp

CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.
http://nlp.cogcomp.org/
Other
471 stars 144 forks source link

Corpusreader for TAC dataset - need usage instructions #750

Open lashmore opened 3 years ago

lashmore commented 3 years ago

It is very difficult to intuitively understand how the TACReader class is meant to be used. What path do I send to "corpusRoot"? Here is the file hierarchy of the raw TAC 2014-2015 data, where 2015 has a similar folder structure to 2014.

From what I can tell, TACReader is breaking down XML documents. The only folder containing XML data is in source_documents. Inside the .txt files is XML file structure. Is TACReader ONLY parsing information from source_documents, or does it parse from other folders in the file structure?

Screen Shot 2021-08-30 at 2 50 10 PM

Here's how I'm trying to use TACReader and here's the error message I'm getting. Note, I've tried a bunch of different paths to set corpusRoot at, and they're all giving me the same error. I'm running completely blind here. Any help would be very appreciated!

import edu.illinois.cs.cogcomp.nlp.corpusreaders.TACReader;

public class PreprocessTAC {
    public static void main(String[] args) throws Exception {
        String path = "/path/to/tac_kbp_eng_event_arg_comp_train_eval_2014-2015/data/";
        TACReader reader_tac = new TACReader(path, false);
    }
}

Error message:

Exception in thread "main" java.lang.NullPointerException: Cannot read the array length because "<local4>" is null
    at edu.illinois.cs.cogcomp.core.io.IOUtils.lsFilesRecursive(IOUtils.java:145)
    at edu.illinois.cs.cogcomp.nlp.corpusreaders.TACReader.getFileListing(TACReader.java:239)
    at edu.illinois.cs.cogcomp.nlp.corpusreaders.XmlDocumentReader.initializeReader(XmlDocumentReader.java:107)
    at edu.illinois.cs.cogcomp.nlp.corpusreaders.AnnotationReader.<init>(AnnotationReader.java:47)
    at edu.illinois.cs.cogcomp.nlp.corpusreaders.AbstractIncrementalCorpusReader.<init>(AbstractIncrementalCorpusReader.java:61)
    at edu.illinois.cs.cogcomp.nlp.corpusreaders.XmlDocumentReader.<init>(XmlDocumentReader.java:89)
    at edu.illinois.cs.cogcomp.nlp.corpusreaders.TACReader.<init>(TACReader.java:113)
    at PreprocessTAC.main(PreprocessTAC.java:7)