CogComp / cogcomp-nlp

CogComp's Natural Language Processing Libraries and Demos: Modules include lemmatizer, ner, pos, prep-srl, quantifier, question type, relation-extraction, similarity, temporal normalizer, tokenizer, transliteration, verb-sense, and more.
http://nlp.cogcomp.org/
Other
473 stars 142 forks source link

String.join includes an extra copy of separator at the end of the string. #593

Open mayhewsw opened 6 years ago

mayhewsw commented 6 years ago

https://github.com/CogComp/cogcomp-nlp/blob/ce0f3a03264d293cd751d7a416bbadd858066496/core-utilities/src/main/java/edu/illinois/cs/cogcomp/core/utilities/StringUtils.java#L60

There is a .trim() call, but this only trims whitespace. If the separator is non-whitespace (e.g. underscore, dash, period, bar, etc.), then trim doesn't work.

Also, Apache Commons has a StringUtils that works very nicely. Perhaps we want to remove this in favor of that.

mssammon commented 6 years ago

I agree that we should replace cogcomp's stringutils with the apache commons library. Probably a fair number of uses throughout the package, though.

schen149 commented 6 years ago

Currently there are some feature generators in Edison that calls the StringUtils.join() to generate feature, which includes an extra copy of dash in the feature name. Example below. https://github.com/CogComp/cogcomp-nlp/blob/7d9dad3fedc16ac59feb278815e27dc195d1367e/edison/src/main/java/edu/illinois/cs/cogcomp/edison/features/helpers/FeatureNGramUtility.java#L55

Switching to Apache Commons will change the feature names, and so break models that relies on Edison (PrepSRL, I think). What's the best thing to do here? @mssammon

cogcomp-dev commented 6 years ago

probably, requires retraining. Create a branch "requires_retrain" and make the change on this branch. Once implemented, create a new issue for prepsrl retraining and refer to it in the branch notes. presumably, it is sufficient to retrain prepsrl models and deploy, then update the dependencies on the branch; at that point, it can be merged. (You are not obligated to take that new prepsrl issue, but can if you want.)