Docs show using VectorType instead of concatenating features

MichaelSimons commented 5 years ago

@nicolehaugen commented on Thu Jul 11 2019

Numerous places in the docs, we show to store features as a VectorType. However, this isn't ideal because it doesn't allow you to easily do feature engineering where you pick\choose the most influential features to include when training a model. Instead, to easily support feature engineering, it's recommended to concatenate your features as part of the pipeline.

For example, here are a few places where we show using a VectorType: 1.) https://docs.microsoft.com/en-us/dotnet/machine-learning/how-to-guides/load-data-ml-net#create-the-data-model 2.) https://github.com/dotnet/machinelearning/blob/master/docs/code/MlNetCookBook.md#how-do-i-load-data-from-a-text-file

Instead, we should show feature concatenation and explain why this is a preferred approach - for example:

   IEstimator<ITransformer> dataPipeline = mlContext.Transforms.Concatenate(FeaturesVectorName, featureCols)
                .Append(mlContext.Transforms.Conversion.MapValueToKey(nameof(SearchResultData.Label)))
                .Append(mlContext.Transforms.Conversion.Hash(nameof(SearchResultData.GroupId), nameof(SearchResultData.GroupId), numberOfBits: 20));

Also, why is there a VectorType attribute? Are there ever benefits to using this? If not, we should consider removing.

@codemzs commented on Fri Jul 12 2019

Hi @nicolehaugen can you please point out the specific docs you are referring? VectorType is used to indicate the size of the vector at compile time.

@nicolehaugen commented on Mon Jul 15 2019

Refer to the links that I provided above. The first one shows use of the VectorType attribute, the second one shows loading features from a text file into a vector column. Also, this link specifically says:

When the input file contains many columns of the same type, always intended to be used together, we recommend loading them as a vector column from the very start: this way the schema of the data is cleaner, and we don't incur unnecessary performance costs.

These docs led me to believe that I should be loading my features as a vector column as part of my data structure. However, later I learned that this really isn't the preferred approach since it doesn't allow the user to easily apply feature engineering by including\excluding different columns within the selected features used for training. It would be helpful to have the docs make this point clear by stating when it's beneficial to load features as a single vector column vs. loading the features as separate properties (and then doing a Concatenate) prior to training.

@codemzs commented on Mon Jul 15 2019

I think the docs are clear when it says "always intended to be used together". The reason we suggest is for performance benefits to prevent unnecessary passes over the dataset. You can always use SlotDroppingTransformer to drop slots and join with other columns. I believe this transformer for some reason is not publicly exposed but that can made a different issue.

@justinormont commented on Tue Jul 16 2019

I very much prefer the columns separated out to encourage users to explore feature engineering techniques, much as mentioned by @nicolehaugen.

Though as a counter example, I just joined the 139,352 dense columns in the Dorothea dataset (download) into a single VectorType. I did this for the speed; an AveragedPerceptron was taking >45min with the 140k discrete columns, and ~13min with a VectorType. As a disclosure, I am testing this dataset only because it has a large number of columns.

We may want to recommend VectorType at some, rather high, point. We would have to investigate where the transition point is in terms of speed. And balance that with the gain of feature engineering ease by keeping the columns separated.

We may want to say, "Keeping the columns separated allows for ease and flexibility of feature engineering, but for a very large number of columns (>N), operating on the individual columns causes a speed impact." I'm uncertain what we should recommend for N. Best to figure out the speed tradeoffs though benchmarks.

@codemzs commented on Tue Jul 16 2019

Sounds good. @justinormont / @nicolehaugen feel free to make the change to the docs and I’ll review them. Thanks!

@nicolehaugen commented on Fri Jul 26 2019

I am going to log a new bug under docs for this issue so that the documentation team can address this.

AB#1572600

nicolehaugen commented 5 years ago

The gist of this is that the docs need to clarify the trade offs\benefits of using the VectorType attribute for loading features vs. loading features separately and then concatenating them in the pipeline. Today, it's not clear which one is preferable depending on the needs of the user.

luisquintanilla commented 5 years ago

Adding the following open issue to the discussion dotnet/machinelearning#3202 to showcase a form of VectorType usage.

In short:

The training process may originally be done using a file or subset of the data where LoadColumn can be used in the input class schema definition. However, scoring may pass in a single observation over HTTP or provide the data in a format that is not stored in a file. In that case, scoring would throw errors because the data types (Single vs Vector) for those columns that are meant to be used together are inconsistent.

natke commented 5 years ago

@codemzs Has the team looked at dotnet/machinelearning#3202? Any recommendations on this issue?

dotnet / docs

Docs show using VectorType instead of concatenating features #13589