I am running the seed_gene_discovery pipeline with data being processed by the preprocessing and annotations pipelines beforehand. I created the phenotypes.parquet, having the same order of samples as in the genotypes.h5 file.
I get an error in rule association_dataset, regarding naming conventions for samples. I think the issue is, that str sample names are not allowed as sample names.
Traceback (most recent call last):
File "PATH/miniconda3/envs/deeprvat/bin/seed_gene_pipeline", line 33, in <module>
sys.exit(load_entry_point('deeprvat', 'console_scripts', 'seed_gene_pipeline')())
File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "PATH/deeprvat/seed_gene_discovery/seed_gene_discovery.py", line 592, in make_dataset
_, ds = make_dataset_(
File "PATH/deeprvat/seed_gene_discovery/seed_gene_discovery.py", line 543, in make_dataset_
dataset = DenseGTDataset(
File "PATH/deeprvat/data/dense_gt.py", line 160, in __init__
self.setup_phenotypes(
File "PATH/deeprvat/data/dense_gt.py", line 365, in setup_phenotypes
samples_gt.astype(int), self.samples.astype(int)
ValueError: invalid literal for int() with base 10: 'Samplename_str'
I don't think it would be viable to add to the docs, that before running the pipeline all sample names should be converted to int. If ints are necessary during processing, maybe a mapping in the affected code sections would be a possible solution?
Looking forward to your opinions and suggestions!
I am running the seed_gene_discovery pipeline with data being processed by the preprocessing and annotations pipelines beforehand. I created the
phenotypes.parquet
, having the same order of samples as in thegenotypes.h5
file.I get an error in rule
association_dataset
, regarding naming conventions for samples. I think the issue is, that str sample names are not allowed as sample names.I don't think it would be viable to add to the docs, that before running the pipeline all sample names should be converted to int. If ints are necessary during processing, maybe a mapping in the affected code sections would be a possible solution? Looking forward to your opinions and suggestions!