Saskia-Oosterbroek / decona

fastq to polished sequenses: pipeline suitable for mixed samples and long (Nanopore) reads
MIT License
43 stars 12 forks source link

Fastq files filtered for quality but then no longer found #53

Closed sjbald closed 3 months ago

sjbald commented 4 months ago

Hi Saskia!

Hello from Canada!

I am using Decona to genotype some amplicon data and I have run into some trouble.

First, I was able to run the example data with no problem.

But, with, my own data, in the output I get no sequences: "total raw sequences = 0 total filtered sequences = 0"

In the output files, I get the original fastq file, the quality filtered fastq files (which is slightly smaller, so some sequences were removed), but then the fasta file is empty.

The pipeline progresses through the rest of the steps but with no sequences.

Do you have any idea what might be happening? The fastq files are good (high quality scored from fastQC) and I have done alignments and assembled them to a reference genome and they look good.

Any ideas at all will be so helpful!

Best, Sarah Baldwin

sjbald commented 4 months ago

I just wanted to add my commands. This problem happens regardless of if I use: decona -f -T 10 -c 0.95 -n 100 -M -g 1 -r -i -s -q 10 or the simplified: decona -T 10 -d 0

sjbald commented 4 months ago

I figured out the issue. In the decona script file the fastq files are converted to fasta files like this:

cat "$file" | grep -A 1 'runid' | sed '/^--$/d' | sed 's/^@/>/' | awk '{print $1}' > "${file%%.*}.fasta"

My files did not have 'runid' in their headers, so I needed to figure out what was always found in the headers and I replaced that with 'runid' using sed:

sed -i 's/mytext/runid/g' filename.fastq

Saskia-Oosterbroek commented 3 months ago

Hi Sarah,

Thanks for the comments. I assume you had Dorado basecalled data, or at least something other than Guppy. The 'runid' was always in the Guppy output. I changed this in the new Decona version 1.4, so it works irrelevant of the basecaller used :)

Best, Saskia

sjbald commented 3 months ago

Thanks Saskia!

Excellent! Yes, I used the newest version of Dorado (v0.7.2). Thank you for your reply.

Now I'm struggling with extracting information from the output data. My code is:

decona -T 10 -q 20 -i . -l 2000 -m 2300 -c 0.99 -s -n 100

This could be completely my mistake, but it seems that -i now specifies the path to the input files instead of giving info about % sequences assigned to clusters. Is that true? If so, is there any way to extract the percentage of reads assigned to cluster as well as the corresponding consensus sequence just for the larger clusters? I am extracting genotype data for alleles (where there may be 1-4 alleles per individual). I don't think that -s is giving me the SNP call output that I would have expected from Medaka either. Finally, my output seems to contain all clusters, not just clusters with a minimum of 100 reads. Is this something buggy on my end?

In the output folder I get subfolders for all of my individuals as well as these summary files: combined2.txt combined.txt Decona_logfile.txt percentage2.txt percentage_per_barcode.txt

And within the individual folders I have: Individual1.trimmed.fastq Individual1.fastq Individual1.fasta cluster_representatives.clstr cluster_representatives.fa multi-seq report_Individual1.txt size_report_Individual1.fasta.txt

I currently have to search for each sequence name from cluster_representatives.fa in cluster_representatives.clstr to name the consensus sequences and then search for which clusters in cluster_representatives.clstr have the most sequences and extract them and their line numbers. I feel like there must be doing this in the least efficient way. Do you have any suggestions for how I can approach this in a smarter way?

Thank you for your time! Best, Sarah Baldwin