Closed singh-krishan closed 2 years ago
Hi @singh-krishan ,
Thanks for your question. Make sure you fit estimators only on training data. This means you should call fit()
or fit_transform()
or fit_predict()
only on training data, never on other data (such as the validation set, the test set, or new data). In your code, you should therefore replace full_pipeline.fit_transform(some_data)
with full_pipeline.transform(some_data)
. However, before you do that, you should first fit the model on the training set.
So the code should look like:
housing_prepared = full_pipeline.fit_transform(housing)
some_data_prepared = full_pipeline.transform(some_data)
In the full training set, there are 5 distinct values in the ocean_proximity
column. That's why after the full_pipeline
is fit on the training set, it outputs one-hot vectors of size 5 for each ocean_proximity
category. But if some_data
is small enough, it is likely to contain less categories, which is what you observed. But if you only call transform(some_data)
and not fit_transform(some_data)
, it will output one-hot vectors of size 5.
Hope this helps.
thanks @ageron , makes sense
Hi @ageron , I have defined the full pipeline as:
where the num_pipeline is defined as:
Now, when I execute this code:
I get this error:
I noticed that when I pass the whole housing dataset to the full_pipeline, the ocean_proximity is getting transformed to 5 different columns resulting in total of 13 fields. But, when I pass only a subset of the dataset (i.e. housing.iloc[:5]), the transformation is not applied to the ocean_proximity column.
Any suggestions on what could be wrong?
Thanks a lot