Open michelole opened 5 years ago
We should
clean
outside of getSentences
(e.g. getTokens
) so that SalienceAnalyzer
can analyze dirty sentences (and thus show eventual overfitting to garbage).clean
and/or lowercase
only in InputRepresentation
getTokens
in TokenIterator
to tokenize
We do cleaning in several places in
DataUtilities
:getSentences()
(called fromSentenceIterator
andgetTokens
)processTextReduced()
(called fromCharacterTrigram
, deprecated)cleanText()
(called fromPatient.getCleanedText()
, used only by RBC)LSTMClassifier.initializeTruncateLength
(removed in d5246663ae211733506a98396fd2acabe16283f9)DRY