Open SomaSapien opened 4 months ago
Hi @SomaSapien, thank you for the feedback and all the details provided. There are some previously opened issues linked to this subject - we're aware of this issue. We are currently trying to update the subject and audit to make this exercise (and the overall project) more consistent.
Hi @nprimo , many thanks for such a speedy response! 😀
Describe the bug
For Exercise 5 in the PIPELINE quest, the supplied answers in the audit deviate significantly from those that myself and other students are getting (for questions 2, 3, and 4). We have reviewed / revised our code several times, whilst also inspecting the dataset for possible errors. We suspect that either the dataset has been updated since the audit answers were last reviewed, or that there might be some kind of system architecture dependency baked in somewhere (regardless of "random_state=43" when spilling into tarining / test sets). We are working with Mac / ARM system architectures.
Users
Students at grit:lab, Åland
Severity
(❗️minor)
Type
(🗂️ documentation)
To Reproduce
Steps to reproduce the behavior:
Jupyter Lab script
Exercise 5: Categorical variables
print("\nExercise 5: Categorical variables")
Preliminary steps
Define column names based on the attribute information from the breast_cancer_readme.txt file
column_names = [ 'age', 'menopause', 'tumor-size', 'inv-nodes', 'node-caps', 'deg-malig', 'breast', 'breast-quad', 'irradiat', 'Class' ]
Load the CSV file, replacing '?' with NaN for easier handling of missing values
df = pd.read_csv('breast-cancer.csv', header=None, names=column_names, na_values='?')
INITIAL LOOK AT DATA
print("\nINITIAL LOOK AT DATA:\n") print(df.head())
Only drop rows with NaN values.
df = df.dropna()
Split the data into features (X) and target (y)
X = df.drop(columns=['Class']) y = df['Class'] # Assuming it might be needed later for modelling
Split the features and target into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)
Question 1: Count the number of unique values per feature in the train set
print("\n\nQuestion 1\n")
unique_counts = X_train.nunique() print(unique_counts)
Question 2: One Hot Encoding for nominal features
print("\n\nQuestion 2")
Updated assumptions for encoding based on attribute information
nominal_features = ['node-caps', 'breast', 'breast-quad', 'irradiat']
ohe = OneHotEncoder(sparse_output=False) ohe.fit(X_train[nominal_features])
Transform the test set with One Hot Encoder
X_test_ohe = ohe.transform(X_test[nominal_features])
Display part of the transformed test set
print("\n#First 10 rows:\n") print(X_test_ohe[:10])
Question 3: Create one Ordinal encoder for all Ordinal features
print("\nQuestion 3")
Specifying the order for ordinal features as provided in the .txt file
ordinal_features_and_categories = [ ('menopause', ['lt40', 'premeno', 'ge40']), # menopause_order ('age', ['10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79', '80-89', '90-99']), # age_order ('tumor-size', ['0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39', '40-44', '45-49', '50-54', '55-59']), # tumor_size_order ('inv-nodes', ['0-2', '3-5', '6-8', '9-11', '12-14', '15-17', '18-20', '21-23', '24-26', '27-29', '30-32', '33-35', '36-39']), # inv_nodes_order ('deg-malig', [1, 2, 3]) # deg_malig_order ]
Separate the feature names and their categories for the OrdinalEncoder
ordinal_features, categories = zip(*ordinal_features_and_categories) ordinal_features = list(ordinal_features) # Ensuring this is a list
Initialising the OrdinalEncoder with specified categories
oe = OrdinalEncoder(categories=[category for _, category in ordinal_features_and_categories])
Fit the encoder on the relevant columns of the training set
oe.fit(X_train[ordinal_features])
Transform the same columns in the test set
X_test_transformed = oe.transform(X_test[ordinal_features]) print("\nTransformed ordinal features in the test set:\n") print(X_test_transformed[:10])
Question 4: Combine both encoders with make_column_transformer
print("\nQuestion 4")
Define the column transformer
column_transformer = make_column_transformer( (OneHotEncoder(), nominalfeatures), (OrdinalEncoder(categories=[category for , category in ordinal_features_and_categories]), ordinal_features) )
Fit on the train set and transform the test set
column_transformer.fit(X_train) X_test_transformed = column_transformer.transform(X_test)
Convert the transformed test set to a dense array if it's sparse, to view the first 2 rows easily
if hasattr(X_test_transformed, "toarray"): X_test_transformed = X_test_transformed.toarray()
print("\nFirst 2 rows of column transformer transformed that is fitted on the X_train:\n") print(X_test_transformed[:2])
Workarounds As this is for a Piscine audit, we are having to just explain our reasoning / calculations, whilst highlighting that the supplied audit answers are possibly outdated.
Expected behavior
The answers / results achieved with the script above:
Question 2
First 10 rows:
[[1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0.] [1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0.] [0. 1. 1. 0. 0. 1. 0. 0. 0. 0. 1.] [0. 1. 1. 0. 0. 1. 0. 0. 0. 0. 1.] [1. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0.] [1. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0.] [1. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0.] [1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0.] [1. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1.] [1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0.]]
Question 3
Transformed ordinal features in the test set:
[[2. 5. 2. 0. 1.] [2. 5. 2. 0. 0.] [2. 5. 4. 5. 2.] [1. 4. 5. 1. 1.] [2. 5. 5. 0. 2.] [1. 2. 1. 0. 1.] [1. 2. 8. 0. 1.] [2. 5. 2. 0. 0.] [2. 5. 5. 0. 2.] [1. 2. 3. 0. 0.]]
Question 4
First 2 rows of column transformer transformed that is fitted on the X_train:
[[1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 2. 5. 2. 0. 1.] [1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 2. 5. 2. 0. 0.]]
Attachments
N/A
Desktop (please complete the following information):
Smartphone (please complete the following information):
N/A
Additional context
N/A