Type and expected format of some featurization techniques unclear

Lab41 / pythia

Supervised learning for novelty detection in text

http://lab41.github.io/pythia/

Other

79 stars 21 forks source link

Type and expected format of some featurization techniques unclear #146

Closed pcallier closed 8 years ago

pcallier commented 8 years ago

1st three args to bow() are str, str, list, and all three should have normalized text w/o stop words. Second arg currently unnormalized.

pcallier commented 8 years ago

1st two args to st() are str, list and the skipthoughts code itself does some normalization very similar to ours (tokenization + adding spaces). Looks like we send it the right input

tukeyclothespin commented 8 years ago

1st two args to lda() are str, str and both should be normalized w/o stop words. It looks like we send it the right input.

However, in preprocess.py we do LDA training build_lda() on the CountVectorizer results for an array of raw body_text entries. I think that we would want to train LDA on the CountVectorizer results for an array of normalized w/o stop words entries.

bethke commented 8 years ago

1st two args in cnn() are list, list and both are normalized w/o stop words, but as they use the one_hot encoding they likely should include stop words (their vocab has stop words and punctuation). The training is also mis-matched as it is being trained on raw data - so fixes should go into two places. [this may be why it is doing so poorly].

bethke commented 8 years ago

1st two args in wordonehot are list, list and are raw. There is xml normalization before being sent to one_hot so it seems good.

tukeyclothespin commented 8 years ago

1st two args in w2v() are str, str and we are currently sending a raw document and normalized (w/o stop words and w/o punctuation) background document text.

Our w2v() approach analyzes the first and last sentence of the document and we have follow-on variants in run_w2v(), run_w2v_elemwise(), run_w2v_matrix. All three of these variants use punkt to find the first and last sentence of the text passed in (doc one time, background text the next time) and then call normalize.remove_stop_words() on the resulting sentences.

Two potential issues:

Both original inputs to w2v() should be changed to normalized with stops words and with punctuation via xml_normalize()
At least run_w2v() and run_elemwise() shouldn't be removing stop words. Possibly also true for run_w2v_matrix