MicrobialDarkMatter / nanomotif

Nanomotif - a tool for identifying methylated motifs in metagenomic samples
MIT License
16 stars 0 forks source link

AssertionError: DNA sequence must be a nucleotide sequence of ATGCRYSWKMBDHVN #56

Open brambloemen opened 3 weeks ago

brambloemen commented 3 weeks ago

Hello,

Thanks for the wonderful tool!

It seems to run fine on many datasets, but for me it fails for one specific dataset, with the following error report:

Activating conda environment: .snakemake/conda/1c99e828952b31ca3eef31b2ad43ef5c_
WARNING:root:Output directory results/F6D39/NanoMotif/ already exists
Traceback (most recent call last):
  File "/scratch/brbloemen/FLUPOUL/FLUPOULmeta/workflow/.snakemake/conda/1c99e828952b31ca3eef31b2ad43ef5c_/bin/nanomotif", line 10, in <module>
    sys.exit(main())
  File "/scratch/brbloemen/FLUPOUL/FLUPOULmeta/workflow/.snakemake/conda/1c99e828952b31ca3eef31b2ad43ef5c_/lib/python3.9/site-packages/nanomotif/main.py", line 514, in main
    motif_discovery(args)
  File "/scratch/brbloemen/FLUPOUL/FLUPOULmeta/workflow/.snakemake/conda/1c99e828952b31ca3eef31b2ad43ef5c_/lib/python3.9/site-packages/nanomotif/main.py", line 299, in motif_discovery
    assembly = nm.load_assembly(args.assembly)
  File "/scratch/brbloemen/FLUPOUL/FLUPOULmeta/workflow/.snakemake/conda/1c99e828952b31ca3eef31b2ad43ef5c_/lib/python3.9/site-packages/nanomotif/dataload.py", line 46, in load_assembly
    return Assembly(load_fasta(path))
  File "/scratch/brbloemen/FLUPOUL/FLUPOULmeta/workflow/.snakemake/conda/1c99e828952b31ca3eef31b2ad43ef5c_/lib/python3.9/site-packages/nanomotif/seq.py", line 13, in __init__
    self.assembly = {name:DNAsequence(seq) for name, seq in assembly.items()}
  File "/scratch/brbloemen/FLUPOUL/FLUPOULmeta/workflow/.snakemake/conda/1c99e828952b31ca3eef31b2ad43ef5c_/lib/python3.9/site-packages/nanomotif/seq.py", line 13, in <dictcomp>
    self.assembly = {name:DNAsequence(seq) for name, seq in assembly.items()}
  File "/scratch/brbloemen/FLUPOUL/FLUPOULmeta/workflow/.snakemake/conda/1c99e828952b31ca3eef31b2ad43ef5c_/lib/python3.9/site-packages/nanomotif/seq.py", line 54, in __init__
    self._check_sequence(sequence)
  File "/scratch/brbloemen/FLUPOUL/FLUPOULmeta/workflow/.snakemake/conda/1c99e828952b31ca3eef31b2ad43ef5c_/lib/python3.9/site-packages/nanomotif/seq.py", line 71, in _check_sequence
    assert all(letter in self.iupac for letter in list(set(sequence))), f"DNA sequence must be a nucleotide sequence of {''.join(self.iupac)}"
AssertionError: DNA sequence must be a nucleotide sequence of ATGCRYSWKMBDHVN

The fastq files with methylation data were created as follows with dorado v0.7.0:

model=/scratch/brbloemen/dorado_models/dna_r10.4.1_e8.2_400bps_sup@v5.0.0
dorado basecaller $model "$INPUT_POD5_PATH" --device cuda:all -b 320 --modified-bases 4mC_5mC 6mA > "basecalling_dor7mod5_modified/${filename}_mods.bam"

samtools fastq "basecalling_dor7mod5_modified/${filename}_mods.bam" -T '*' | pigz -p9 > "basecalling_dor7mod5_modified/${filename}_mods.fastq.gz"

For the assembly I used Flye v2.9.3, using the --meta and --nano-hq parameters.

I aligned the fastq to the assembly fasta with minimap 2.26:

minimap2 -ax map-ont -y -t {threads} {input.assembly_fasta} {input.fastq} > {output.sam}

and filtered, sorted and indexed with samtools:

samtools view -hb -F0x904 -@{threads} {input.sam} | samtools sort -@{threads} > {output.bam}
samtools index -@{threads} {output.bam} -o {output.bai}
samtools faidx {input.assembly} -o {output.fai}
SorenHeidelbach commented 1 week ago

The problem is loading the contigs in the assembly file. It seems there are characters that are not on the IUPAC form in a contig sequence. If you could check if there are characters that are not one of ATGCRYSWKMBDHVN