During the discrete supervised munging step using GenoML's example data, the pipeline fails with a TypeError. The error occurs when trying to apply median imputation to what appears to be the sample ID column, which should be handled as identifiers rather than numeric data.
Environment Setup
Using custom Dockerfile based on jupyter/datascience-notebook:
Build and run the Docker container
docker build -t my-datascience-notebook .
docker run -it my-datascience-notebook
Had to remove version restriction for xgboost in requirements.txt due to Python version compatibility issues
Original requirement: xgboost==2.0.3
Modified to: xgboost
Pipeline successfully completes PLINK dependency check
Successfully exports genotype data
Completes SNP pruning (12 of 500 variants removed)
Fails during the final data munging step with TypeError
TypeError: Cannot convert [['sample81' 'sample158' 'sample216' ...]] to numeric
The error occurs in the following sequence:
pythonCopyFile ".../genoml/cli/munging.py", line 75
df = munger.plink_inputs()
File ".../genoml/preprocessing/munging.py", line 213
raw_df = raw_df.fillna(raw_df.median())
The pipeline should recognize sample ID columns as identifiers and exclude them from numeric operations like median imputation. This appears to be an issue with column handling during the munging process rather than with the input data format.
Bug Description
During the discrete supervised munging step using GenoML's example data, the pipeline fails with a TypeError. The error occurs when trying to apply median imputation to what appears to be the sample ID column, which should be handled as identifiers rather than numeric data.
Environment Setup
Using custom Dockerfile based on jupyter/datascience-notebook:
Build and run the Docker container
docker build -t my-datascience-notebook . docker run -it my-datascience-notebook
Inside the container, clone GenoML repository
git clone https://github.com/GenoML/genoml2.git cd genoml2
Modified requirements
Had to remove version restriction for xgboost in requirements.txt due to Python version compatibility issues Original requirement: xgboost==2.0.3 Modified to: xgboost
Install requirements
pip install -r requirements.txt
Using GenoML's example data
genoml discrete supervised munge \ --prefix outputs \ --geno examples/discrete/training \ --pheno examples/discrete/training_pheno.csv \ --addit examples/discrete/training_addit.csv
Error message
Pipeline successfully completes PLINK dependency check Successfully exports genotype data Completes SNP pruning (12 of 500 variants removed) Fails during the final data munging step with TypeError
TypeError: Cannot convert [['sample81' 'sample158' 'sample216' ...]] to numeric The error occurs in the following sequence: pythonCopyFile ".../genoml/cli/munging.py", line 75 df = munger.plink_inputs() File ".../genoml/preprocessing/munging.py", line 213 raw_df = raw_df.fillna(raw_df.median())
The pipeline should recognize sample ID columns as identifiers and exclude them from numeric operations like median imputation. This appears to be an issue with column handling during the munging process rather than with the input data format.