amueller / scipy-2016-sklearn

Scikit-learn tutorial at SciPy2016
Creative Commons Zero v1.0 Universal
515 stars 516 forks source link

out of core nb 27 #35

Closed rasbt closed 8 years ago

rasbt commented 8 years ago

fixes #15

amueller commented 8 years ago

is the manual garbage collection necessary? the hashing vectorizer should be really small anyhow, right?

amueller commented 8 years ago

otherwise lgtm

rasbt commented 8 years ago

I used this previously in a different context because I had massive issues (later figured that I fit the pipeline via .fit(docs_train, docs_train) instead of .fit(docs_train, y_train) this was eating up memory like nothing else ...

In any case, I thought we could leave this in there just as a general thing for people who are using Jupyter notebooks with many large objects ... however, we could also remove it.

amueller commented 8 years ago

I find gc'ing the hashing vectorizer is weird as the whole point of it is that it doesn't take up a lot of memory. gc'ing the count vectorizer is fine.

rasbt commented 8 years ago

Yeah, I am going to remove it now, for the hashing vectorizer that's really stupid, just wanted it for the counting one, but the vocab is just 75,000 or so

amueller commented 8 years ago

I mean you could keep it for the count vectorizer saying "lets remove this because the vocab is so large" but yeah it's not really that large ^^

amueller commented 8 years ago

feel free to merge after update

rasbt commented 8 years ago

I see now what you mean, I had it in there 2 times. I think I'll just insert one before the Out-of-core learning section

import gc

del count_vec
del h_pipeline

gc.collect()

and explain that we can do this to get rid of objects that we are not going to use anymore, although these are not thaaaaat large in this case (it's more to bring the point across, since people may create many of these when experimenting with hyperparams or so)