edgardomortiz / Captus

Assembly of Phylogenomic Datasets from High-Throughput Sequencing data
https://edgardomortiz.github.io/captus.docs/
GNU General Public License v3.0
19 stars 5 forks source link

Number of reads per contig? #14

Open Zachary-Muscavitch opened 1 month ago

Zachary-Muscavitch commented 1 month ago

I'm wondering if it would it be possible to indicate the number of reads assembled into each contig? I am doing target capture on metagenomic extractions which contain multiple organisms and some of them are congeners. Thus when extracting and aligning sequences the paralogs aren't necessarily paralogs and may instead be homologs.

Often, these homologs are present in vastly different concentrations in the source material, and this is reflected in the number of reads for each homolog. It would be nice if there was an option on paralog filtering step to filter based on the number of reads per contig as I usually want to contig with the greatest read depth and not the one which is most similar to the reference.

Maybe this is already some where in the output files.

edgardomortiz commented 1 month ago

Dear @Zachary-Muscavitch ,

This kind of filtering is exactly what we are working on now, we will use Salmon to determine the real coverage of each contig and we will also give the option to only recover targets that were assembled in a single contig.

In the meantime you have all this info in the stats.tsv inside the extraction folder, the contig names contain the MEGAHIT estimate of of coverage in the name (cov_x.xxx)

I know right now it is a lot of parsing on your side but I hope it helps until I finish the new filters.

Best,

Edgardo