amueller / scipy-2016-sklearn

Scikit-learn tutorial at SciPy2016
Creative Commons Zero v1.0 Universal
515 stars 516 forks source link

redo text with imdb instead of text messages #13

Closed amueller closed 8 years ago

amueller commented 8 years ago

14 Application: IMDB Movie Review Sentiment Analysis

rasbt commented 8 years ago

Oh, just commented on #15 regarding the dataset. I think it may be better to stick with the small dataset in the intro notebook 14 and use the big IMDb one for #28 (out of core).

I've a parsed CSV of the dataset here at: https://github.com/rasbt/python-machine-learning-book/tree/master/code/datasets/movie

Not sure if it wouldn't be better to read it from there via the fetch_data.py script since the original is basically a hierarchical directory structure of 50,000 files which may take a while (too long) to parse?

amueller commented 8 years ago

there is the load_files function which does it. It takes a couple of seconds, but not too bad, I think.

rasbt commented 8 years ago

okay nice! Will use this one then!

rasbt commented 8 years ago

I think we can close this. We are using the SMS spam dataset for the text-classification intro, and the IMDb for out-of-core learning (as per #35)