PMBio / deeprvat

Other
15 stars 1 forks source link

Str sample names in seed gene discovery pipeline leads to errors. #70

Closed Jonas-B-Frank closed 1 month ago

Jonas-B-Frank commented 2 months ago

I am running the seed_gene_discovery pipeline with data being processed by the preprocessing and annotations pipelines beforehand. I created the phenotypes.parquet, having the same order of samples as in the genotypes.h5 file.

I get an error in rule association_dataset, regarding naming conventions for samples. I think the issue is, that str sample names are not allowed as sample names.

Traceback (most recent call last):
  File "PATH/miniconda3/envs/deeprvat/bin/seed_gene_pipeline", line 33, in <module>
    sys.exit(load_entry_point('deeprvat', 'console_scripts', 'seed_gene_pipeline')())
  File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "PATH/miniconda3/envs/deeprvat/lib/python3.8/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "PATH/deeprvat/seed_gene_discovery/seed_gene_discovery.py", line 592, in make_dataset
    _, ds = make_dataset_(
  File "PATH/deeprvat/seed_gene_discovery/seed_gene_discovery.py", line 543, in make_dataset_
    dataset = DenseGTDataset(
  File "PATH/deeprvat/data/dense_gt.py", line 160, in __init__
    self.setup_phenotypes(
  File "PATH/deeprvat/data/dense_gt.py", line 365, in setup_phenotypes
    samples_gt.astype(int), self.samples.astype(int)
ValueError: invalid literal for int() with base 10: 'Samplename_str'

I don't think it would be viable to add to the docs, that before running the pipeline all sample names should be converted to int. If ints are necessary during processing, maybe a mapping in the affected code sections would be a possible solution? Looking forward to your opinions and suggestions!