akikuno / DAJIN2

🔬 Genotyping tool for genome-edited samples, utilizing nanopore sequencer target sequencing
MIT License
7 stars 0 forks source link

genome_fetcher.py error #26

Open takeiga opened 3 months ago

takeiga commented 3 months ago

After updating to 0.4.3, genome_fetcher.py reported error:

DAJIN2 --control barcode01 --sample barcode02 --allele actc1L_cont_knockin.fa --name 02 --genome xenLae2 --threads 8
2024-03-29 11:48:17, INFO, barcode01 is now processing...
2024-03-29 11:48:19, ERROR, Catch an Exception. Traceback:
Traceback (most recent call last):
  File "/home/igawa/miniconda3/envs/dajin2/bin/DAJIN2", line 8, in <module>
    sys.exit(execute())
  File "/home/igawa/miniconda3/envs/dajin2/lib/python3.10/site-packages/DAJIN2/main.py", line 236, in execute
    execute_single_mode(arguments)
  File "/home/igawa/miniconda3/envs/dajin2/lib/python3.10/site-packages/DAJIN2/main.py", line 47, in execute_single_mode
    core.execute_control(arguments)
  File "/home/igawa/miniconda3/envs/dajin2/lib/python3.10/site-packages/DAJIN2/core/core.py", line 26, in execute_control
    ARGS: FormattedInputs = preprocess.format_inputs(arguments)
  File "/home/igawa/miniconda3/envs/dajin2/lib/python3.10/site-packages/DAJIN2/core/preprocess/input_formatter.py", line 96, in format_inputs
    genome_coordinates = get_genome_coordinates(genome_urls, fasta_alleles, is_cache_genome, tempdir)
  File "/home/igawa/miniconda3/envs/dajin2/lib/python3.10/site-packages/DAJIN2/core/preprocess/input_formatter.py", line 67, in get_genome_coordinates
    genome_coordinates = preprocess.fetch_coordinates(genome_coordinates, genome_urls, fasta_alleles["control"])
  File "/home/igawa/miniconda3/envs/dajin2/lib/python3.10/site-packages/DAJIN2/core/preprocess/genome_fetcher.py", line 29, in fetch_coordinates
    coordinate_start = fetch_seq_coordinates(genome, blat_url, seq_start)
  File "/home/igawa/miniconda3/envs/dajin2/lib/python3.10/site-packages/DAJIN2/core/preprocess/genome_fetcher.py", line 18, in fetch_seq_coordinates
    raise ValueError(f"{seq[:60]}... is not found in {genome}")
ValueError: TTATAATTCAGCATCTAGACAGCAGCAACAAGCATTACCCTGGAATGGTTCATAATATGC... is not found in xenLae2

I confirmed run completed successfully when I replaced older genome_fetcher.py, so it may come from updated one. Thank you for your effort, anyway!

akikuno commented 3 months ago

I appreciate your reports!

As you mentioned, I've updated genome_fetcher.py to ensure a perfect match between the control sequence and its reference. Consequently, there's a possibility that your control sequence in actc1L_cont_knockin.fa might not align with the reference. Could you possibly share your control sequence in actc1L_cont_knockin.fa? I'd like to examine the cause of error using your control sequence if possible.

I confirmed run completed successfully when I replaced older genome_fetcher.py, so it may come from updated one.

I am pleased to hear this. Your feedback is greatly appreciated! Thank you for your valuable contribution.

takeiga commented 3 months ago

Our control sequence data is derived from the latest version of NCBI's Xenopus laevis reference genome (Xenopus_laevis_v10.1). But I used old ref genome, UCSC's xenLae2 for DAJIN2 prerequisite and those two different version of the genome contains nucleotide differences. So I understood the reason why the genome_fetcher.py showed error and will use UCSC's data if needed. I personally hope we can use NCBI data for reference data in further update of DAJIN2 if possible.

akikuno commented 2 months ago

Thank you for your description! I'll rethink the method for obtaining genome coordinates. It might take some time, but I'll share the information here once it's updated.