amueller / scipy-2016-sklearn

Scikit-learn tutorial at SciPy2016
Creative Commons Zero v1.0 Universal
515 stars 516 forks source link

Replace spam by imdb text data #76

Open rhiever opened 7 years ago

rhiever commented 7 years ago

Per the TODO file. Maybe @amueller can elaborate on this issue.

amueller commented 7 years ago

Well the imdb text data is bigger and what is used in the book. It does take a while to process, though. We could use a subsample, maybe?

rhiever commented 7 years ago

What's "a while"? Minutes, hours, days? :-)

Either way, yes---using a subsample is probably the way to go.

rasbt commented 7 years ago

Either way, yes---using a subsample is probably the way to go.

I agree. I think one idea was to kind of motivate why we sometimes need to opt for a hashing vectorizer and/or out-of-core learning algorithm when it doesn't fit into memory. However, having a smaller subsample would be fine (after shuffling).

Coincidentally, I've used the dataset in my book as well :P And yeah, people were complaining that it takes too long (~5-10 minutes) and when they choose a subsample, the performance was really bad -- or in other words, people want the best of both worlds some times ... However, for the tutorial, I agree that having a subsample would be really necessary to keep on schedule ;)