Open mayhewsw opened 6 years ago
I agree that we should replace cogcomp's stringutils with the apache commons library. Probably a fair number of uses throughout the package, though.
Currently there are some feature generators in Edison that calls the StringUtils.join() to generate feature, which includes an extra copy of dash in the feature name. Example below. https://github.com/CogComp/cogcomp-nlp/blob/7d9dad3fedc16ac59feb278815e27dc195d1367e/edison/src/main/java/edu/illinois/cs/cogcomp/edison/features/helpers/FeatureNGramUtility.java#L55
Switching to Apache Commons will change the feature names, and so break models that relies on Edison (PrepSRL, I think). What's the best thing to do here? @mssammon
probably, requires retraining. Create a branch "requires_retrain" and make the change on this branch. Once implemented, create a new issue for prepsrl retraining and refer to it in the branch notes. presumably, it is sufficient to retrain prepsrl models and deploy, then update the dependencies on the branch; at that point, it can be merged. (You are not obligated to take that new prepsrl issue, but can if you want.)
https://github.com/CogComp/cogcomp-nlp/blob/ce0f3a03264d293cd751d7a416bbadd858066496/core-utilities/src/main/java/edu/illinois/cs/cogcomp/core/utilities/StringUtils.java#L60
There is a .trim() call, but this only trims whitespace. If the separator is non-whitespace (e.g. underscore, dash, period, bar, etc.), then trim doesn't work.
Also, Apache Commons has a StringUtils that works very nicely. Perhaps we want to remove this in favor of that.