GenoML / genoml2

GenoML (genoml2) is an open source Python package. It is an automated machine learning (autoML) platform for genomics data
Apache License 2.0
28 stars 17 forks source link

GenoML Munging Failure: TypeError when Processing Sample IDs during Imputation #46

Open ensiferum877 opened 2 weeks ago

ensiferum877 commented 2 weeks ago

Bug Description

During the discrete supervised munging step using GenoML's example data, the pipeline fails with a TypeError. The error occurs when trying to apply median imputation to what appears to be the sample ID column, which should be handled as identifiers rather than numeric data.

Environment Setup

Using custom Dockerfile based on jupyter/datascience-notebook:

Build and run the Docker container

docker build -t my-datascience-notebook . docker run -it my-datascience-notebook

Inside the container, clone GenoML repository

git clone https://github.com/GenoML/genoml2.git cd genoml2

Modified requirements

Had to remove version restriction for xgboost in requirements.txt due to Python version compatibility issues Original requirement: xgboost==2.0.3 Modified to: xgboost

Install requirements

pip install -r requirements.txt

Using GenoML's example data

genoml discrete supervised munge \ --prefix outputs \ --geno examples/discrete/training \ --pheno examples/discrete/training_pheno.csv \ --addit examples/discrete/training_addit.csv

Error message

Pipeline successfully completes PLINK dependency check Successfully exports genotype data Completes SNP pruning (12 of 500 variants removed) Fails during the final data munging step with TypeError

TypeError: Cannot convert [['sample81' 'sample158' 'sample216' ...]] to numeric The error occurs in the following sequence: pythonCopyFile ".../genoml/cli/munging.py", line 75 df = munger.plink_inputs() File ".../genoml/preprocessing/munging.py", line 213 raw_df = raw_df.fillna(raw_df.median())

The pipeline should recognize sample ID columns as identifiers and exclude them from numeric operations like median imputation. This appears to be an issue with column handling during the munging process rather than with the input data format.