flennerhag / mlens

ML-Ensemble – high performance ensemble learning
http://ml-ensemble.com
MIT License
841 stars 107 forks source link

Error when propagating features from sparse matrix #46

Closed jattenberg closed 7 years ago

jattenberg commented 7 years ago

I'm trying to use mlens in a system I'm developing but, based on the documentation and the code, it's not really clear to me what propagate_features values I should use given my data. Could you offer a bit of additional explanation in the tutorial so I know what should go in?

flennerhag commented 7 years ago

Hi,

Sorry about that. I'm traveling at the moment, but to give you a quick explanation: the input argument should be a list of integers that gives the column ids of the features you want to propagate. Innternally, a slice X[..., your_list] of the input matrix will be propagated to the output matrix.

Importantly, in the output matrix, the propagated features are the leftmost columns. So if you pass [3, 6, 9] as the propagate_features to layer 1, in the the output matrix of layer 1, these columns will have indices [0, 1, 2].

Hope this helps for now, let me know if you have any other questions!

jattenberg commented 7 years ago

Ok this is how I was interpreting things as well. However, I was encountering the error:

P_out[:, :lyr.n_feature_prop] = P_in[r:, lyr.propagate_features]
ValueError: setting an array element with a sequence.

This seems to only occur when passing high-dimensional inputs through the propagate_features parameter. I dont know what python does internally with this kind of array slicing, but this seems to break it. I'll continue digging.

flennerhag commented 7 years ago

Seems like you've encountered a bug. The error message suggests that the left-hand side of the equality is an element, but the right-hand side is an array. I did a quick try to replicate the error but failed. Could you supply a minimum working example that generates the error?

Also, what do you mean with "high-dimensional inputs"?

jattenberg commented 7 years ago

working through a minimal reproducing example, but this seems to need a bit more digging that previously thought. I'll keep working.

By "high-dimensional inputs" i mean cases where the number of covariates is very high, eg in the context of numpy, X.shape[1] is some very large number, as is typically encountered in, for instance text classification, recommender systems, or networked environments.

flennerhag commented 7 years ago

Ok, thanks. Just wanted to check that you weren't passing some 3-D tensor or such. I didn't get any error when I tried random numpy matrices with 1000, so not sure it due to the size of the input matrix.

The error specifically says that P_in[r:, lyr.propagate_features] is an array, while P_out[:, :lyr.n_feature_prop] is an element. Since the output matrix shoudn't be an element, my guess it that something funky is going on with P_out. A few things to check:

  1. What is the shape of the output matrix of the ensemble without feature propagation? Does that look like it should?
  2. What is the length of the layer's propagate_features attribute? I.e. len(ens.layer_1.propagate_features). Is that the same number as ens.layer_1.n_feature_prop?
jattenberg commented 7 years ago

Ok! I've been able to isolate the error. See here. The problem seems to have to do with sparse input arrays (eg from transformed text values) I wonder if there's a more numpy-friendly way to do array slicing?

flennerhag commented 7 years ago

Great work! I can confirm that the error occurs when the input matrix is a scipy.sparse matrix.

The problem is as follows: the P_out matrix is a numpy array, and P_in is a scipy sparse matrix. In the eyes of numpy, P_in is an element, thus causing the error. The simplest solution is to convert P_in[r:, lyr.propagate_features] to a numpy array, perhaps with a warning. In this case some sort of isinstance test has to be made followed by a toarray call. This approach has no impact on the remaining codebase, but the drawback is that this would make convert the sparse input to a dense matrix and as such could blow both memory and fitting time. An alternative is to make P_out sparse, but I'm not sure this works with memmapping when populating P_out with predictions.

jattenberg commented 7 years ago

i wonder if there might be a better approach using sklearn pipelines? In particular, sklearn.pipeline.FeatureUnion? In my internal application, I have been mixing and selecting sparse and dense arrays using pipeline functionality.

flennerhag commented 7 years ago

Thanks for the idea. I took a look at the source code for the FeatureUnion class. Essentially, it will cast the output matrix as sparse if any of the intermediate feature matrices is sparse. We can adopt the same logic here by making P_out sparse if P_in is sparse.

I've made a first pass at implementing this in the feat_prop branch. Would be great if you could give it a go on your problem and check that it solves the issue.

jattenberg commented 7 years ago

yes this seems to work! :1st_place_medal:

flennerhag commented 7 years ago

Great. Issue solved by PR #48.