DRY Cleaning - Githubissues

bst-mug / n2c2

Support code for participation at the 2018 n2c2 Shared-Task Track 1

https://n2c2.dbmi.hms.harvard.edu

Apache License 2.0

6 stars 4 forks source link

Open michelole opened 5 years ago

michelole commented 5 years ago

We do cleaning in several places in DataUtilities:

getSentences() (called from SentenceIterator and getTokens)
processTextReduced() (called from CharacterTrigram, deprecated)
cleanText() (called from Patient.getCleanedText(), used only by RBC)
LSTMClassifier.initializeTruncateLength (removed in d5246663ae211733506a98396fd2acabe16283f9)

DRY

michelole commented 5 years ago

We should

Call clean outside of getSentences (e.g. getTokens) so that SalienceAnalyzer can analyze dirty sentences (and thus show eventual overfitting to garbage).
NN: call clean and/or lowercase only in InputRepresentation
Replace the call to getTokens in TokenIterator to tokenize