Open rileyhun opened 4 years ago
@rileyhun looks like there's some missing imports. Can you fill those out?
And is this a minimal example? Do you need the timing stuff, print statements, etc? See http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports
It also looks like X_train
isn't defined.
@TomAugspurger Added more details to the original post
Thanks @rileyhun. It seems like data
is undefined.
@TomAugspurger, I made one more edit to the original comment -- I am defining data
@rileyhun see http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports for writing bug reports. I don't have that CSV file. Since the issue isn't with reading a CSV you could ideally create the dataset in memory.
As mentioned in the original post, the grid search works without dask as the back-end. I am now getting this error when I run it again using dask:
ValueError: X has 205757 features per sample; expecting 206501
Here is a snippet of the dataset:
entity_name | classification |
---|---|
great tech | other |
xfone communication ltd | other |
coventrys | other |
pt invensys indonesia | other |
massillon cable tv inc | other |
city of New York | government |
police department | government |
ministry of finance | government |
US Navy | military |
US Army | military |
AFB | military |
Let me know you can provide a reproducible example.
Okay re-ran a third time, and getting the same error.
ValueError: X has 207586 features per sample; expecting 205996
The search space I am using is just 2 params:
param_grid = {
"model__tol": [0.001, 0.01]
}
I am using Python 3.7.3 and Dask 2.14
Is Dask Grid Search always supposed to outperform Loky Backend? It's also noticeably slower even though I'm using a cluster with 5 dask workers, each with 12 cpus
I won't be able to help until you provide a minimal, reproducible example.
On Tue, Apr 7, 2020 at 3:38 PM Riley Hun notifications@github.com wrote:
Okay re-ran a third time, and getting the same error.
ValueError: X has 207586 features per sample; expecting 205996
The search space I am using is just 2 params:
param_grid = { "model__tol": [0.001, 0.01] }
I am using Python 3.7.3 and Dask 2.14
Is Dask Grid Search always supposed to outperform Loky Backend? It's also noticeably slower even though I'm using a cluster with 5 dask workers, each with 12 cpus
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/636#issuecomment-610608263, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAOIVSG6BX7E3OMCZQXO3RLOFMXANCNFSM4MCVHXKA .
The code under Code Example is copy-pasteable. You just need to change the cluster IP endpoint.
@rileyhun Why ngram_range=(2, 10)
? That's a ton of n-grams, and results in a large memory and computation cost. I think ngram_range=(1, 4)
is typical (or some number smaller than 4). When I set ngram_range=(2, 4)
the error disappears.
It looks like the number of features are changing, which is alarming. I'm not sure why.
In a distributed context, a HashingVectorizer is often preferred over CountVectorizer/TfIdfVectorizer because it's stateless.
@rileyhun Why
ngram_range=(2, 10)
? That's a ton of n-grams, and results in a large memory and computation cost. I thinkngram_range=(1, 4)
is typical (or some number smaller than 4). When I setngram_range=(2, 4)
the error disappears.It looks like the number of features are changing, which is alarming. I'm not sure why.
In a distributed context, a HashingVectorizer is often preferred over CountVectorizer/TfIdfVectorizer because it's stateless.
Keep in mind that I'm using character n-grams, not word n-grams. As such, I've found that the (2, 10) range is good at picking up deviations in spelling. I could try a smaller range though and re-run and see if that impacts the accuracy.
I am not an expert, but during cross validation, would the number of features change due different assortment of entity names?
I'll also look into HashingVectorizer.
Thanks!
character n-grams, not word n-grams.
Whoops, I missed that. Never mind.
during cross validation, would the number of features change due different assortment of entity names?
I would expect that because different words will be given to different cv
splits, but I'm not seeing why that's an issue. The code runs when fine with joblib.parallel_backend('dask')
is commented out.
I think the next steps will come down finding a single representative example. http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports has some tips. I'd start by commenting various things out and seeing how far I can go.
I re-ran using a smaller n-gram range and also using hashvectorizer instead, and I didn't run into this error, thus far.
Thanks for these tips! Appreciate it!
I ran into a similar bug with HyperbandSearchCV
. It starts with client.compute(fit_params)
and ends in the same error (KeyError: 'data'
). Here's the traceback:
I've done some debugging, and have resolved some issues (making sure valid parameters are passed, etc). I haven't seen this error since; I'll report again if I do.
Ran into this error as well... Have you made progress on getting around this @stsievert or @rileyhun ?
@vinodshanbhag as I mentioned in https://github.com/dask/dask-ml/issues/636#issuecomment-635544754, I got around it by cleaning my workflow "(passing valid parameters, etc)." It'd be great if you have a minimal working example (http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports).
I am getting the following error when running a gridsearch on dask distributed back-end. This error is nonexistent when just running sklearn gridsearch on single core local machine. I don't know where that KeyError is coming from; I don't have anything in my pipeline that references the key 'data'.
Here is the full error traceback I am getting:
Sample Dataset
Code Example
There are no conflicts between scheduler, client and the dask workers.