amplab / keystone

Simplifying robust end-to-end machine learning on Apache Spark.
http://keystone-ml.org/
Apache License 2.0
469 stars 116 forks source link

Text Classification Pipeline #12

Closed etrain closed 9 years ago

etrain commented 9 years ago

Stop-word removal -> N-Grams -> TFIDF -> Naive Bayes

Should work on 20 Newsgroups and RCV1.

tomerk commented 9 years ago

Did we want to leave this using naive bayes, or change it to use a linear classifier?

tomerk commented 9 years ago

Also, I assume I should be making it be on top of @concretevitamin's tokenizer & n-grams as opposed to the quick implementations I initially wrote?

concretevitamin commented 9 years ago

@tomerk @etrain Related discussion to #65 - I want to understand what the pipelines (other than the language model pipeline) need in terms of ngram related stuff. Both of the following, or some other interface(s)?

RDD[String] => RDD[Set[String]] // Evan's snippet; not sure why Set instead of Seq
RDD[String] => RDD[Seq[(String, Double)]] // SimpleNGramTokenizer; not sure why appending 1d here
tomerk commented 9 years ago

These are what the most recent version of the text classification pipeline are currently using from a purely text-manipulation perspective:

object StringTransformers {
  def tokenizer(sep: String = "[\\p{Punct}\\s]+"): Transformer[String, Seq[String]] = Transformer(_.split(sep))

  def toLowerCase: Transformer[String, String] = Transformer(_.toLowerCase)

  def trim: Transformer[String, String] = Transformer(_.trim)

  def nGrams[T](sizes: Seq[Int]) = Transformer {
    (terms: Seq[T]) => sizes.map(size => terms.sliding(size)).flatMap(identity)
  }

  def termFreq(fun: DataType => DataType) = Transformer {
    (x: Seq[Any]) => x.groupBy(identity).mapValues(x => fun(x.size)).toSeq
  }
}

nGrams is Seq[T] => Seq[Seq[T]] termFreq is Seq[Any] => Seq[(Any, DataType)]

tomerk commented 9 years ago

The reason for termFreq taking a function is because often you want various meanings: e.g. log(# of occurrences), total # of occurrences, whether a term appeared at all (1) or not (0)

etrain commented 9 years ago

@tomerk - Naive Bayes is fine. Let's not try to change too much. Thanks for the explanation of the text classification pipeline - does that help Zongheng? The other place this stuff will get used is an event extraction pipeline, but from an interfaces standpoint that will look basically like text classification.

Yes - it makes sense to be using consistent infrastructure across the text classification pipelines.

tomerk commented 9 years ago

@etrain Do we already have the RCV1 data? If not we need to get it from: http://trec.nist.gov/data/reuters/reuters.html

tomerk commented 9 years ago

Also, we may want to us TRC2 instead of RCV1?

etrain commented 9 years ago

I'm not sure that TRC2 has labels - RCV1 is pretty standard and has the advantage that it should run on a single machine for demo purposes. It's a good idea to find a bigger labeled dataset as well!

On Fri, Apr 24, 2015 at 1:46 PM, Tomer Kaftan notifications@github.com wrote:

Also, we may want to us TRC2 instead of RCV1?

— Reply to this email directly or view it on GitHub https://github.com/amplab/keystone/issues/12#issuecomment-96060472.

etrain commented 9 years ago

Closed by #70 thanks Tomer.