Training the ML model: [Failed]

eugeneychsiao commented 5 years ago

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux
GenoML installed from (source or binary): installed through pip
Python version: 3.7.4

Describe the current behavior I've tried running the model on three separate datasets(including the provided sample dataset), and all three times came up with this error: The main failure points are simply hardware related at this phase of work. Is your data way too big for your computer? Also, the code implemented here occasionally bugs if you have too many zero variance predictors in the dataset, but you probably already removed those before starting your analyses, right?

All dependencies have been installed successfully.

mikeDTI commented 5 years ago

Hi. Please include your command. A verbose log would be helpful as well (-v or -vvv options). Also as a note, we are working on refactoring the newer full python version at the moment, should be out in Nov '19. Thanks.

eugeneychsiao commented 5 years ago

I run the most basic model-train, ie: genoml-train --geno-prefix=./training --pheno-file=./training.pheno --model-file=./testModel

Data pruning works individually(genome-cli data-prune), problem occurs during training the model. Here is the output with verbose log:

====> Automated Machine Learning for Genomic ========> Dependency Check ============> Checking PRSice ============> Checking PRSice: [Done]

============> Checking GCTA ============> Checking GCTA: [Done]

============> Checking plink ============> Checking plink: [Done]

============> Checking R ================> Checking R Packages ================> Checking R Packages: [Done]

============> Checking R: [Done]

========> Dependency Check: [Done]

/tmp/tmpkgza42_z {'--addit-file': None, '--best-model-name': 'best_model', '--cov-file': None, '--cv-reps': '5', '--geno-prefix': './training', '--grid-search': '10', '--gwas-file': None, '--help': False, '--herit': None, '--impute-data': 'median', '--model-dir': None, '--model-file': './testModel', '--n-cores': '1', '--no-tune': False, '--pheno-file': './training.pheno', '--prune-prefix': '/tmp/tmpkgza42_z/model', '--train-speed': 'BOOSTED', '--version': False, '-v': 3} ========> Pruning the SNPs

============> Checking genotype file Checking the genotype file and calculate stats

============> Checking genotype file: [Done]

============> Checking Input Files Checking if all the input files are available

Mapping files: 100%|#############################| 3/3 [00:00<00:00, 137.89it/s] ============> Checking Input Files: [Done]

============> Pairwise SNP pruning Pruning SNPs to a minimal set by removing correlated SNPs within a sliding window. This step speeds up the ML model training and reduces possible bias due to overfitting of the subsequent models.

PLINK v1.90b6.7 64-bit (2 Dec 2018) www.cog-genomics.org/plink/1.9/ (C) 2005-2018 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to /tmp/tmpkgza42_z/model.temp.log. Options in effect: --bfile ./training --indep-pairwise 10000 1 0.1 --out /tmp/tmpkgza42_z/model.temp

354362 MB RAM detected; reserving 177181 MB for main workspace. 500 variants loaded from .bim file. 500 people (331 males, 169 females) loaded from .fam. 500 phenotype values loaded from .fam. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 500 founders and 0 nonfounders present. Calculating allele frequencies... done. 500 variants and 500 people pass filters and QC. Among remaining phenotypes, 342 are cases and 158 are controls. Pruned 2 variants from chromosome 1, leaving 46. Pruned 4 variants from chromosome 2, leaving 37. Pruned 3 variants from chromosome 3, leaving 43. Pruned 5 variants from chromosome 4, leaving 35. Pruned 1 variant from chromosome 5, leaving 29. Pruned 2 variants from chromosome 6, leaving 38. Pruned 0 variants from chromosome 7, leaving 14. Pruned 1 variant from chromosome 8, leaving 13. Pruned 1 variant from chromosome 9, leaving 16. Pruned 2 variants from chromosome 10, leaving 27. Pruned 3 variants from chromosome 11, leaving 27. Pruned 3 variants from chromosome 12, leaving 21. Pruned 0 variants from chromosome 13, leaving 9. Pruned 3 variants from chromosome 14, leaving 16. Pruned 0 variants from chromosome 15, leaving 6. Pruned 0 variants from chromosome 16, leaving 16. Pruned 0 variants from chromosome 17, leaving 26. Pruned 2 variants from chromosome 18, leaving 7. Pruned 0 variants from chromosome 19, leaving 12. Pruned 0 variants from chromosome 20, leaving 11. Pruned 0 variants from chromosome 21, leaving 6. Pruned 1 variant from chromosome 22, leaving 12. Pruning complete. 33 of 500 variants removed. Marker lists written to /tmp/tmpkgza42_z/model.temp.prune.in and /tmp/tmpkgza42_z/model.temp.prune.out . PLINK v1.90b6.7 64-bit (2 Dec 2018) www.cog-genomics.org/plink/1.9/ (C) 2005-2018 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to /tmp/tmpkgza42_z/model.reduced_genos.log. Options in effect: --bfile ./training --extract /tmp/tmpkgza42_z/model.temp.prune.in --out /tmp/tmpkgza42_z/model.reduced_genos --recode A

354362 MB RAM detected; reserving 177181 MB for main workspace. 500 variants loaded from .bim file. 500 people (331 males, 169 females) loaded from .fam. 500 phenotype values loaded from .fam. --extract: 467 variants remaining. Using 1 thread (no multithreaded calculations invoked). Before main variant filters, 500 founders and 0 nonfounders present. Calculating allele frequencies... done. 467 variants and 500 people pass filters and QC. Among remaining phenotypes, 342 are cases and 158 are controls. --recode A to /tmp/tmpkgza42_z/model.reduced_genos.raw ... done. ============> Pairwise SNP pruning: [Done]

============> Merging datasets for model training Merging all datasets with individual level data. Only individuals specified in all files will be retained.

[1] "/usr/lib/R/bin/exec/R"
[2] "--slave"
[3] "--no-restore"
[4] "--file=/usr/local/lib/python3.6/dist-packages/genoml/misc/R/mergeForGenoML.R" [5] "--args"
[6] "./training"
[7] "./training.pheno"
[8] "NA"
[9] "NA"
[10] "/tmp/tmpkgza42_z/model"
[1] "/tmp/tmpkgza42_z/model" ============> Merging datasets for model training: [Done]

========> Pruning the SNPs: [Done]

========> Training the ML model Training and selecting the best ML model based on the best mean cross-validation performance

========> Training the ML model: [Failed] The main failure points are simply hardware related at this phase of work. Is your data way too big for your computer? Also, the code implemented here occasionally bugs if you have too many zero variance predictors in the dataset, but you probably already removed those before starting your analyses, right?

Traceback (most recent call last): IndexError: list index out of range ====> Automated Machine Learning for Genomic: [Failed]

Traceback (most recent call last): IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

A few things to note: I'm using Cffi version 1.12.3, since the newest version 1.13.1 causes an error with rpy2. The download link for GCTA version 1.91.7 is no longer available, so I'm using version 1.92.4.

Is it possible that these are causing the errors?

mikeDTI commented 5 years ago

Hi Eugene, apparently it is an issue with an update to the training data. We are about to pull the R version and replace it with a much improved python version. I can send you a draft of those script if you'd like, just let me know. This way your work will be compatible with next month's supported release (LTS). Thanks! Mike A. Nalls, PhD Data Tecnica International http://www.datatecnica.com/ Data scientist / CEO mike@datatecnica.com mike@datatecnica.com +1 (202) 468-1533 <(202)%20468-1533>

On Wed, Oct 23, 2019 at 11:11 PM Eugene Hsiao notifications@github.com wrote:

I run the most basic model-train, ie: genoml-train --geno-prefix=./training --pheno-file=./training.pheno --model-file=./testModel

Data pruning works individually(genome-cli data-prune), problem occurs during training the model. Here is the output:

====> Automated Machine Learning for Genomic ========> Dependency Check ============> Checking PRSice ============> Checking PRSice: [Done]

============> Checking GCTA ============> Checking GCTA: [Done]

============> Checking plink ============> Checking plink: [Done]

============> Checking R ================> Checking R Packages ================> Checking R Packages: [Done]

============> Checking R: [Done]

========> Dependency Check: [Done]

========> Pruning the SNPs ============> Checking genotype file ============> Checking genotype file: [Done]

============> Checking Input Files Mapping files: 100%|#############################| 3/3 [00:00<00:00, 102.90it/s] ============> Checking Input Files: [Done]

============> Pairwise SNP pruning ============> Pairwise SNP pruning: [Done]

============> Merging datasets for model training ============> Merging datasets for model training: [Done]

========> Pruning the SNPs: [Done]

========> Training the ML model ========> Training the ML model: [Failed] The main failure points are simply hardware related at this phase of work. Is your data way too big for your computer? Also, the code implemented here occasionally bugs if you have too many zero variance predictors in the dataset, but you probably already removed those before starting your analyses, right?

Traceback (most recent call last): IndexError: list index out of range ====> Automated Machine Learning for Genomic: [Failed]

Traceback (most recent call last): IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/GenoML/genoml/issues/12?email_source=notifications&email_token=AJTEJEOGYQQJTIFAEKRNCOLQQEG5FA5CNFSM4JD3ALR2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECDRD3Q#issuecomment-545722862, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJTEJELFK46N2XMWA7GKMV3QQEG5FANCNFSM4JD3ALRQ .

GenoML / genoml

Training the ML model: [Failed] #12