Closed pcallier closed 8 years ago
1st two args to st()
are str, list
and the skipthoughts code itself does some normalization very similar to ours (tokenization + adding spaces). Looks like we send it the right input
1st two args to lda()
are str, str
and both should be normalized w/o stop words. It looks like we send it the right input.
However, in preprocess.py
we do LDA training build_lda()
on the CountVectorizer results for an array of raw body_text entries. I think that we would want to train LDA on the CountVectorizer results for an array of normalized w/o stop words entries.
1st two args in cnn()
are list, list
and both are normalized w/o stop words, but as they use the one_hot encoding they likely should include stop words (their vocab has stop words and punctuation). The training is also mis-matched as it is being trained on raw data - so fixes should go into two places. [this may be why it is doing so poorly].
1st two args in wordonehot
are list, list
and are raw. There is xml normalization before being sent to one_hot so it seems good.
1st two args in w2v()
are str, str
and we are currently sending a raw document and normalized (w/o stop words and w/o punctuation) background document text.
Our w2v()
approach analyzes the first and last sentence of the document and we have follow-on variants in run_w2v(), run_w2v_elemwise(), run_w2v_matrix
. All three of these variants use punkt to find the first and last sentence of the text passed in (doc one time, background text the next time) and then call normalize.remove_stop_words()
on the resulting sentences.
Two potential issues:
w2v()
should be changed to normalized with stops words and with punctuation via xml_normalize()
run_w2v()
and run_elemwise()
shouldn't be removing stop words. Possibly also true for run_w2v_matrix
1st three args to
bow()
arestr, str, list
, and all three should have normalized text w/o stop words. Second arg currently unnormalized.