Closed bsmith89 closed 6 months ago
One piece of data regarding the unintuitive flags I'm using with samtools depth
:
At least for one (reasonably large) example BAM, the -g SECONDARY
option has no effect on the total number of mapped bases counted up by samtools depth
.
samtools depth -g SECONDARY --min-MQ 0 <sorted_bam_output> | awk '{tally+=$2}END{print tally}' > tally.with_secondary.txt
samtools depth --min-MQ 0 <sorted_bam_output> | awk '{tally+=$2}END{print tally}' > tally.no_secondary.txt
I therefore think that flag can be dropped. This suggests my BAMs do not include any secondary alignments, which may be either due to bowtie2 or my downstream processing.
Bowtie2 index is built on all centroid99 sequences (after filtering) from a list of species
midas2 run_genes --species_list
where the user can build one bowtie2 pangenome database for the given list of species. Bowtie2 is run with the following flags
bowtie2 -x <centroid99_bowtie_db> --no-unal --mm -U <input_r1> -U <input_r2> --seed <random_seed> --local --ignore-quals --end-to-end --very-sensitive
[ ] I can add the flexible of adding optional bowtie2 aligner parameters to MIDAS2.
The post-alignment filter options can be customized to no filter.
Total number of bases mapping to each gene is summed up across all positions. Sum all of these c99 depths in each c95 (or c90/85/80/75, depending on desired clustering)
run_genes
. Divide the total number of mapped bases by the centroid99 length.
coverage
(defined as total number of bases mapped per centroid_99 divided by centroid99 length)?Here is what I proposed for calculating the copy number
per centroid_99:
clean imported
genomes.Report the prevalence of centroid_xx for select list of genomes to genes_summary.tsv
.
Add the options of choosing the denominator of copy number
compute per centroid_99:
I need to think about the potential user case of this update. For the least, you can re-compute the copy number in your script if MIDAS2 provide the centroid_xx prevalence information.
Please let me know if I miss out anything. Thanks.
Chunyu
By "sum all of these c99 depths", do you mean the coverage (defined as total number of bases mapped per centroid_99 divided by centroid99 length)?
Yes. I'm using depth throughout to mean mean "vertical coverage" and agree with your definition. I just don't like the word "coverage" because it's meaning can be ambiguous. "Mean depth" seems unambiguous.
This has already implemented by midas2 run_genes --species_list where the user can build one bowtie2 pangenome database for the given list of species. Total number of bases is already implemented in MIDAS2 run_genes.
Awesome!
I can add the flexible of adding optional bowtie2 aligner parameters to MIDAS2. The post-alignment filter options can be customized to no filter.
Great! I know that I found some non-intuitive things about how samtools depth
calculates depth (vertical coverage) So we should probably double check that it really is no filter.
I will add the flexibility of compute the coverage either on centroid99 or centroid95 level.
This shouldn't actually matter. As long as c95 depth (vertical coverage) is the sum of all c99 depths, I can sum it up at any level and get the same answer. I do all my downstream work at c75.
Here is what I proposed for calculating the copy number per centroid_99...
Yeah, let's have a more involved discussion about this one. I'm not sure what's best for the various audiences.
All the above mentioned changes have been implemented in version 2.0.0.rc2
Based on the StrainPGC workflow, here is a different way to estimate gene depth:
Bowtie2 index is built on all centroid99 sequences (after filtering) from a list of species (see L1-17 and L59-95)
Bowtie2 is run with the following flags (I'm not sure how sensitive the results are to these): (see L100-146)
(Note:
sort_bam.sh
is meant as a placeholder for my implementation of BAM sorting.)(Note:
sum_by_genes_script.awk
is also a placeholder for my hack-y implementation.)(...We should also discuss the
-g SECONDARY
and--min-MQ 0
flags. I'm not remembering the details of why I do that, but I vaguely remember there being something very unintuitive about howsamtools depth
filters reads... 🤔)This step is easiest to do for each species individually, while also merging all samples together into one file. (see L212-246 and the relevant script:
merge_pangenomes_depth.py
)An update to MIDAS's
run_genes
and/ormerge_genes
that accomplishes all of the above would be ideal. Alternatively, I would also welcome including any subset of those steps.