Closed etrain closed 9 years ago
Did we want to leave this using naive bayes, or change it to use a linear classifier?
Also, I assume I should be making it be on top of @concretevitamin's tokenizer & n-grams as opposed to the quick implementations I initially wrote?
@tomerk @etrain Related discussion to #65 - I want to understand what the pipelines (other than the language model pipeline) need in terms of ngram related stuff. Both of the following, or some other interface(s)?
RDD[String] => RDD[Set[String]] // Evan's snippet; not sure why Set instead of Seq
RDD[String] => RDD[Seq[(String, Double)]] // SimpleNGramTokenizer; not sure why appending 1d here
These are what the most recent version of the text classification pipeline are currently using from a purely text-manipulation perspective:
object StringTransformers {
def tokenizer(sep: String = "[\\p{Punct}\\s]+"): Transformer[String, Seq[String]] = Transformer(_.split(sep))
def toLowerCase: Transformer[String, String] = Transformer(_.toLowerCase)
def trim: Transformer[String, String] = Transformer(_.trim)
def nGrams[T](sizes: Seq[Int]) = Transformer {
(terms: Seq[T]) => sizes.map(size => terms.sliding(size)).flatMap(identity)
}
def termFreq(fun: DataType => DataType) = Transformer {
(x: Seq[Any]) => x.groupBy(identity).mapValues(x => fun(x.size)).toSeq
}
}
nGrams is Seq[T] => Seq[Seq[T]] termFreq is Seq[Any] => Seq[(Any, DataType)]
The reason for termFreq taking a function is because often you want various meanings: e.g. log(# of occurrences), total # of occurrences, whether a term appeared at all (1) or not (0)
@tomerk - Naive Bayes is fine. Let's not try to change too much. Thanks for the explanation of the text classification pipeline - does that help Zongheng? The other place this stuff will get used is an event extraction pipeline, but from an interfaces standpoint that will look basically like text classification.
Yes - it makes sense to be using consistent infrastructure across the text classification pipelines.
@etrain Do we already have the RCV1 data? If not we need to get it from: http://trec.nist.gov/data/reuters/reuters.html
Also, we may want to us TRC2 instead of RCV1?
I'm not sure that TRC2 has labels - RCV1 is pretty standard and has the advantage that it should run on a single machine for demo purposes. It's a good idea to find a bigger labeled dataset as well!
On Fri, Apr 24, 2015 at 1:46 PM, Tomer Kaftan notifications@github.com wrote:
Also, we may want to us TRC2 instead of RCV1?
— Reply to this email directly or view it on GitHub https://github.com/amplab/keystone/issues/12#issuecomment-96060472.
Closed by #70 thanks Tomer.
Stop-word removal -> N-Grams -> TFIDF -> Naive Bayes
Should work on 20 Newsgroups and RCV1.