GenoML / genoml2

GenoML (genoml2) is an open source Python package. It is an automated machine learning (autoML) platform for genomics data
Apache License 2.0
27 stars 17 forks source link

Interesting people put non-variant columns into ML experiments??? #5

Closed mikeDTI closed 4 years ago

mikeDTI commented 4 years ago

Please make sure that this is a bug.

System information

Describe the current behavior Will crash at Z scoring feature w/o variance

Describe the expected behavior Interesting people put non-variant columns into ML experiments??? Mary is going to put an extra condition at line 210 of munge that skips columns with standard deviations of 0 and leaves a snarky message ;-)

Code to reproduce the issue AMP PD transcriptomics

Other info / logs Ask MM

m-makarious commented 4 years ago

The Code:

            # Remove any columns with a standard deviation of zero
            print(f"Removing any columns that have a standard deviation of 0 prior to Z-scaling...")

            if any(addit_df.std() == 0.0):
                print("")
                print(f"Looks like there's at least one column with a standard deviation of 0. Let's remove that for you...")
                addit_keep = addit_df.drop(addit_df.std()[addit_df.std() == 0.0].index.values, axis=1)
                addit_keep_list = list(addit_keep.columns.values)

                addit_df = addit_df[addit_keep_list]

                addit_keep_list.remove('ID')
                removed_list = np.setdiff1d(cols, addit_keep_list)
                for removed_column in range(len(removed_list)):
                    print("") 
                    print(f"The column {removed_list[removed_column]} was removed")
                    print("")

                cols = addit_keep_list

Description: Munging has been updated to now perform a cursory glance if any columns have a standard deviation of 0 is found.

Given this information is not useful, and resorts in issues downstream, any column that has a standard deviation of 0 is removed moving forward, and the user is informed which columns were removed interactively as well.

m-makarious commented 4 years ago

Moved issue to new repo for completeness and consistency