akikuno / DAJIN2

🔬 Genotyping tool for genome-edited samples, utilizing nanopore sequencer target sequencing
MIT License
9 stars 0 forks source link

An error occurs if there is an "_" in the FASTA header name. #39

Closed akikuno closed 6 months ago

akikuno commented 6 months ago

Describe the bug

If an underscore _ (e.g., >1_hoge) is included in the header name of the input FASTA file for DAJIN2, the following error occurs:

2024-05-29 17:42:33, INFO, 🏃 Start running DAJIN2 version 0.4.6
2024-05-29 17:42:33, INFO, example_single/control is now processing...
2024-05-29 17:42:33, INFO, Preprocess example_single/control...
2024-05-29 17:42:57, INFO, Output BAM files of example_single/control...
2024-05-29 17:42:57, INFO, 🍵 example_single/control is finished!
2024-05-29 17:42:57, INFO, example_single/sample is now processing...
2024-05-29 17:42:57, INFO, Preprocess example_single/sample...
2024-05-29 17:43:30, INFO, Classify example_single/sample...
2024-05-29 17:43:33, INFO, Clustering example_single/sample...
2024-05-29 17:43:41, INFO, Consensus calling of example_single/sample...
2024-05-29 17:43:41, ERROR, Catch an Exception. Traceback:
Traceback (most recent call last):
  File "/home/kuno/miniconda/envs/env-dajin2/bin/DAJIN2", line 10, in <module>
    sys.exit(execute())
  File "/home/kuno/miniconda/envs/env-dajin2/lib/python3.10/site-packages/DAJIN2/main.py", line 236, in execute
    execute_single_mode(arguments)
  File "/home/kuno/miniconda/envs/env-dajin2/lib/python3.10/site-packages/DAJIN2/main.py", line 46, in execute_single_mode
    core.execute_sample(arguments)
  File "/home/kuno/miniconda/envs/env-dajin2/lib/python3.10/site-packages/DAJIN2/core/core.py", line 187, in execute_sample
    consensus.cache_mutation_loci(ARGS, clust_subset_sample)
  File "/home/kuno/miniconda/envs/env-dajin2/lib/python3.10/site-packages/DAJIN2/core/consensus/mutation_extractor.py", line 101, in cache_mutation_loci
    cache_normalized_indels(ARGS, path_midsv_sample)
  File "/home/kuno/miniconda/envs/env-dajin2/lib/python3.10/site-packages/DAJIN2/core/consensus/mutation_extractor.py", line 73, in cache_normalized_indels
    sequence = ARGS.fasta_alleles[allele]
KeyError: '1'

Solutions

The cause of the error is the frequent use of split("_") on the path without considering the use of underscores in the header name. In DAJIN2, various annotations are added to the header name using _ as the delimiter. If the user-specified FASTA header name contains _, the expected splits are misaligned.

To handle cases where "_" is included, appropriate splitting should be performed. Specifically, it is recommended to remove the FASTA header name before splitting.

The following script contains hard-coded instances of the above issue, which need to be corrected.

DAJIN2/src/DAJIN2/core/consensus/mutation_extractor.py:    allele, label, *_ = path_midsv_sample.stem.split("_")
DAJIN2/src/DAJIN2/core/consensus/mutation_extractor.py:        allele, label, *_ = path_indels_normalized_sample.stem.split("_")
DAJIN2/src/DAJIN2/core/consensus/similarity_searcher.py:    allele, label, *_ = Path(path_midsv_sample).stem.split("_")
DAJIN2/src/DAJIN2/core/preprocess/midsv_caller.py:        preset = path.stem.split("_")[0]
DAJIN2/src/DAJIN2/core/preprocess/midsv_caller.py:        preset = path.stem.split("_")[0]
DAJIN2/src/DAJIN2/core/report/sequence_exporter.py:    allele = header.split("_")[1]
DAJIN2/src/DAJIN2/utils/report_generator.py:        label, allele, type_, *_ = reads["NAME"].split("_")

Steps/Code to Reproduce

Operating System

WLS2

Python version

3.10

DAJIN2 version

0.4.6

Additional context

Thank you @geedrn for reporting the issue!!

akikuno commented 6 months ago

Modified the system to separate intermediate files using a directory structure instead of underscores (""), ensuring that no errors occur even if users use allele names containing underscores ("").

The imprementation will be reflect on DAJIN2 v0.5.0.