CogComp / cogcomp-nlp

CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.
http://nlp.cogcomp.org/
Other
470 stars 142 forks source link

duplicate files in md #580

Closed mssammon closed 6 years ago

mssammon commented 6 years ago

These are apparently duplicates or replications of files in other modules ('external', 'corpusreaders', and 'tokenizer').

If they are duplicates, replace them. If they are modified versions, unify them.

https://github.com/CogComp/cogcomp-nlp/blob/master/md/src/main/java/edu/illinois/cs/cogcomp/pipeline/handlers/StanfordTrueCaseHandler.java

https://github.com/CogComp/cogcomp-nlp/blob/master/md/src/main/java/edu/illinois/cs/cogcomp/nlp/corpusreaders/ACEReader.java

https://github.com/CogComp/cogcomp-nlp/blob/master/md/src/main/java/edu/illinois/cs/cogcomp/nlp/tokenizer/TokenizerStateMachine.java

Slash0BZ commented 6 years ago

Ok I will handle it

Slash0BZ commented 6 years ago

For ACE reader: I think I have to keep this version unless 1) we merge the true-cased ACEReader into corpusreader 2) we discard true-casing in both MD and RE @danyaljj We don't want the true-cased ACEReader in corpusreader right?

danyaljj commented 6 years ago

We don't want the true-cased ACEReader in corpusreader

True. is it possible to make it a class that extends the ACEReader of corpusreader and overrides the method that uses the true-caser? Also, if you change the name it would reduce the confusion. (For example ACEReaderWithTrueCaseFixer). An alternative method is to write a function that given TextAnnotation with messed up casing, it fixes the casing and creates a new TextAnnotation. This was you can reuse the ACEReader of corpusreaders.

mssammon commented 6 years ago

I really like the idea of a truecaser function...

Slash0BZ commented 6 years ago

I will first try to extent the class. I think it should work.