SomaSapien commented 4 months ago

Describe the bug

For Exercise 5 in the PIPELINE quest, the supplied answers in the audit deviate significantly from those that myself and other students are getting (for questions 2, 3, and 4). We have reviewed / revised our code several times, whilst also inspecting the dataset for possible errors. We suspect that either the dataset has been updated since the audit answers were last reviewed, or that there might be some kind of system architecture dependency baked in somewhere (regardless of "random_state=43" when spilling into tarining / test sets). We are working with Mac / ARM system architectures.

Users

Students at grit:lab, Åland

Severity

(❗️minor)

Type

(🗂️ documentation)

To Reproduce

Steps to reproduce the behavior:

Jupyter Lab script

Exercise 5: Categorical variables

print("\nExercise 5: Categorical variables")

Preliminary steps

Define column names based on the attribute information from the breast_cancer_readme.txt file

column_names = [ 'age', 'menopause', 'tumor-size', 'inv-nodes', 'node-caps', 'deg-malig', 'breast', 'breast-quad', 'irradiat', 'Class' ]

Load the CSV file, replacing '?' with NaN for easier handling of missing values

df = pd.read_csv('breast-cancer.csv', header=None, names=column_names, na_values='?')

INITIAL LOOK AT DATA

print("\nINITIAL LOOK AT DATA:\n") print(df.head())

Only drop rows with NaN values.

df = df.dropna()

Split the data into features (X) and target (y)

X = df.drop(columns=['Class']) y = df['Class'] # Assuming it might be needed later for modelling

Split the features and target into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=43)

Question 1: Count the number of unique values per feature in the train set

print("\n\nQuestion 1\n")

unique_counts = X_train.nunique() print(unique_counts)

Question 2: One Hot Encoding for nominal features

print("\n\nQuestion 2")

Updated assumptions for encoding based on attribute information

nominal_features = ['node-caps', 'breast', 'breast-quad', 'irradiat']

ohe = OneHotEncoder(sparse_output=False) ohe.fit(X_train[nominal_features])

Transform the test set with One Hot Encoder

X_test_ohe = ohe.transform(X_test[nominal_features])

Display part of the transformed test set

print("\n#First 10 rows:\n") print(X_test_ohe[:10])

Question 3: Create one Ordinal encoder for all Ordinal features

print("\nQuestion 3")

Specifying the order for ordinal features as provided in the .txt file

ordinal_features_and_categories = [ ('menopause', ['lt40', 'premeno', 'ge40']), # menopause_order ('age', ['10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79', '80-89', '90-99']), # age_order ('tumor-size', ['0-4', '5-9', '10-14', '15-19', '20-24', '25-29', '30-34', '35-39', '40-44', '45-49', '50-54', '55-59']), # tumor_size_order ('inv-nodes', ['0-2', '3-5', '6-8', '9-11', '12-14', '15-17', '18-20', '21-23', '24-26', '27-29', '30-32', '33-35', '36-39']), # inv_nodes_order ('deg-malig', [1, 2, 3]) # deg_malig_order ]

Separate the feature names and their categories for the OrdinalEncoder

ordinal_features, categories = zip(*ordinal_features_and_categories) ordinal_features = list(ordinal_features) # Ensuring this is a list

Initialising the OrdinalEncoder with specified categories

oe = OrdinalEncoder(categories=[category for _, category in ordinal_features_and_categories])

Fit the encoder on the relevant columns of the training set

oe.fit(X_train[ordinal_features])

Transform the same columns in the test set

X_test_transformed = oe.transform(X_test[ordinal_features]) print("\nTransformed ordinal features in the test set:\n") print(X_test_transformed[:10])

Question 4: Combine both encoders with make_column_transformer

print("\nQuestion 4")

Define the column transformer

column_transformer = make_column_transformer( (OneHotEncoder(), nominalfeatures), (OrdinalEncoder(categories=[category for , category in ordinal_features_and_categories]), ordinal_features) )

Fit on the train set and transform the test set

column_transformer.fit(X_train) X_test_transformed = column_transformer.transform(X_test)

Convert the transformed test set to a dense array if it's sparse, to view the first 2 rows easily

if hasattr(X_test_transformed, "toarray"): X_test_transformed = X_test_transformed.toarray()

print("\nFirst 2 rows of column transformer transformed that is fitted on the X_train:\n") print(X_test_transformed[:2])

Workarounds As this is for a Piscine audit, we are having to just explain our reasoning / calculations, whilst highlighting that the supplied audit answers are possibly outdated.

Expected behavior

The answers / results achieved with the script above:

Question 2

First 10 rows:

[[1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0.] [1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0.] [0. 1. 1. 0. 0. 1. 0. 0. 0. 0. 1.] [0. 1. 1. 0. 0. 1. 0. 0. 0. 0. 1.] [1. 0. 1. 0. 0. 0. 1. 0. 0. 1. 0.] [1. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0.] [1. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0.] [1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0.] [1. 0. 1. 0. 0. 0. 0. 1. 0. 0. 1.] [1. 0. 0. 1. 0. 1. 0. 0. 0. 1. 0.]]

Question 3

Transformed ordinal features in the test set:

[[2. 5. 2. 0. 1.] [2. 5. 2. 0. 0.] [2. 5. 4. 5. 2.] [1. 4. 5. 1. 1.] [2. 5. 5. 0. 2.] [1. 2. 1. 0. 1.] [1. 2. 8. 0. 1.] [2. 5. 2. 0. 0.] [2. 5. 5. 0. 2.] [1. 2. 3. 0. 0.]]

Question 4

First 2 rows of column transformer transformed that is fitted on the X_train:

[[1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 2. 5. 2. 0. 1.] [1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 0. 2. 5. 2. 0. 0.]]

Attachments

N/A

Desktop (please complete the following information):

OS: Mac OS, Sonoma 14.3.1
Browser: Chrome (N/A)
Version: (N/A)

Smartphone (please complete the following information):

N/A

Additional context

N/A

nprimo commented 4 months ago

Hi @SomaSapien, thank you for the feedback and all the details provided. There are some previously opened issues linked to this subject - we're aware of this issue. We are currently trying to update the subject and audit to make this exercise (and the overall project) more consistent.