MicrosoftDocs / azure-docs

Open source documentation of Microsoft Azure
https://docs.microsoft.com/azure
Creative Commons Attribution 4.0 International
10.27k stars 21.46k forks source link

Column Level visibility when auto-process is enabled #20324

Closed kumaranpravin closed 5 years ago

kumaranpravin commented 5 years ago

If the pre-process is set to True, i know there are some pre-processing steps applied to the columns. In my case, I have a date column and the result shows there are totally 73 features that are derived from it. Is there any way, where can i see the underlying transformation (ie) i wanted all the 73 features and how they are derived?


Document Details

Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

YutongTie-MSFT commented 5 years ago

@kumaranpravin Thanks for the feedback! I have assigned the issue to the content author to investigate further and update the document as appropriate.

@nacharya1 Hi, do we have any document about the visibility? Or if there anyway to do that? Thanks.

nacharya1 commented 5 years ago

@kumaranpravin, I am looking into this request.

nacharya1 commented 5 years ago

@kumaranpravin when preprocessing is turned on the following processing occurs. With transformers working together you are observing the 73 features being created. These will aid the accuracy and performance of the model.

  1. Dropping high cardinality or no variance features o Features with no useful information are dropped from training and validation sets. These include features with all values missing, same value across all rows or with extremely high cardinality (e.g., hashes, IDs or GUIDs).
  2. Missing value imputation o For numerical features, missing values are imputed with average of values in the column. o For categorical features, missing values are imputed with most frequent value.
  3. Generating additional features o For DateTime features: Year, Month, Day, Day of week, Day of year, Quarter, Week of the year, Hour, Minute, Second. o For Text features: Term frequency based on bi-grams and tri-grams, Count vectorizer.
  4. Transformations and encodings o Numeric features with very few unique values are transformed into categorical features.
kumaranpravin commented 5 years ago

@nacharya1 I understand that there are some preprocessing techniques handled when pre-process condition is flagged to true. But i wanted to somehow find all the converted features. Let's say i'm using the classification model, if i wanted to show the feature importance i guess at this time it displays by column_1, column_2 etc. But i feel it could be better if there is a way to uncover the feature names instead of using the names like column_1...column_73 etc.

nacharya1 commented 5 years ago

@kumaranpravin, we do not expose these details today, we are working on providing more insight to preprocessing. We plan to share more details in the new year.

kumaranpravin commented 5 years ago

@nacharya1 Thanks for the info, Do you have any estimate time for this?. I needed it very urgently.

nacharya1 commented 5 years ago

@kumaranpravin. We are working on this with a high priority, we don't have any further details to share at this time.

YutongTie-MSFT commented 5 years ago

@kumaranpravin We will now proceed to close this thread. If there are further questions regarding this matter, please tag me in your reply. We will gladly continue the discussion and we will reopen the issue.