Feature request: Change how gene depth is estimated

bsmith89 commented 8 months ago

Based on the StrainPGC workflow, here is a different way to estimate gene depth:

Bowtie2 index is built on all centroid99 sequences (after filtering) from a list of species (see L1-17 and L59-95)
Bowtie2 is run with the following flags (I'm not sure how sensitive the results are to these): (see L100-146)

bowtie2 -x <centroid99_bowtie_db> --no-unal --mm -U <input_r1> -U <input_r2> --seed <random_seed> --local --ignore-quals --end-to-end --very-sensitive | sort_bam.sh > <sorted_bam_output>

(Note: sort_bam.sh is meant as a placeholder for my implementation of BAM sorting.)

Total number of bases mapping to each gene is summed up across all positions, e.g. using samtools depth: (see L186-209)

samtools depth <sorted_bam_output>  -g SECONDARY --min-MQ 0 | sum_by_genes_script.awk > <total_bases_mapping_by_gene>

(Note: sum_by_genes_script.awk is also a placeholder for my hack-y implementation.)

(...We should also discuss the -g SECONDARY and --min-MQ 0 flags. I'm not remembering the details of why I do that, but I vaguely remember there being something very unintuitive about how samtools depth filters reads... 🤔)

Divide the total number of mapped bases by the centroid99 length. Sum all of these c99 depths in each c95 (or c90/85/80/75, depending on desired clustering)

This step is easiest to do for each species individually, while also merging all samples together into one file. (see L212-246 and the relevant script: merge_pangenomes_depth.py)

An update to MIDAS's run_genes and/or merge_genes that accomplishes all of the above would be ideal. Alternatively, I would also welcome including any subset of those steps.

bsmith89 commented 8 months ago

One piece of data regarding the unintuitive flags I'm using with samtools depth:

At least for one (reasonably large) example BAM, the -g SECONDARY option has no effect on the total number of mapped bases counted up by samtools depth.

samtools depth -g SECONDARY --min-MQ 0 <sorted_bam_output> | awk '{tally+=$2}END{print tally}' > tally.with_secondary.txt
samtools depth              --min-MQ 0 <sorted_bam_output> | awk '{tally+=$2}END{print tally}' > tally.no_secondary.txt

I therefore think that flag can be dropped. This suggests my BAMs do not include any secondary alignments, which may be either due to bowtie2 or my downstream processing.

zhaoc1 commented 7 months ago

Bowtie2 index is built on all centroid99 sequences (after filtering) from a list of species

This has already implemented by midas2 run_genes --species_list where the user can build one bowtie2 pangenome database for the given list of species.

Bowtie2 is run with the following flags bowtie2 -x <centroid99_bowtie_db> --no-unal --mm -U <input_r1> -U <input_r2> --seed <random_seed> --local --ignore-quals --end-to-end --very-sensitive

[ ] I can add the flexible of adding optional bowtie2 aligner parameters to MIDAS2.
The post-alignment filter options can be customized to no filter.

Total number of bases mapping to each gene is summed up across all positions. Sum all of these c99 depths in each c95 (or c90/85/80/75, depending on desired clustering)

Total number of bases is already implemented in MIDAS2 run_genes.
[ ] I will add the flexibility of compute the coverage either on centroid99 or centroid95 level.

Divide the total number of mapped bases by the centroid99 length.

By "sum all of these c99 depths", do you mean the coverage (defined as total number of bases mapped per centroid_99 divided by centroid99 length)?

Here is what I proposed for calculating the copy number per centroid_99:

Add centroid_XX prevalence for a given list of genomes per species

[ ] Choice of genome lists: custom list or genomes list filtered by checkM's completeness and/or contamination.
[ ] Might need to run checkM for all clean imported genomes.

Report the prevalence of centroid_xx for select list of genomes to genes_summary.tsv.
Add the options of choosing the denominator of copy number compute per centroid_99:
- Original MIDAS2 implementation, centroid_99s corresponding to the 15 university single copy marker genes, OR
- list of centroid_99s based on the centroid_99s prevalence filter (the user can set up the cutoff).

I need to think about the potential user case of this update. For the least, you can re-compute the copy number in your script if MIDAS2 provide the centroid_xx prevalence information.

The denominator will be the mean or median of the list of centroid_99s from step 4.

Please let me know if I miss out anything. Thanks.

Chunyu

bsmith89 commented 7 months ago

By "sum all of these c99 depths", do you mean the coverage (defined as total number of bases mapped per centroid_99 divided by centroid99 length)?

Yes. I'm using depth throughout to mean mean "vertical coverage" and agree with your definition. I just don't like the word "coverage" because it's meaning can be ambiguous. "Mean depth" seems unambiguous.

This has already implemented by midas2 run_genes --species_list where the user can build one bowtie2 pangenome database for the given list of species. Total number of bases is already implemented in MIDAS2 run_genes.

Awesome!

I can add the flexible of adding optional bowtie2 aligner parameters to MIDAS2. The post-alignment filter options can be customized to no filter.

Great! I know that I found some non-intuitive things about how samtools depth calculates depth (vertical coverage) So we should probably double check that it really is no filter.

I will add the flexibility of compute the coverage either on centroid99 or centroid95 level.

This shouldn't actually matter. As long as c95 depth (vertical coverage) is the sum of all c99 depths, I can sum it up at any level and get the same answer. I do all my downstream work at c75.

Here is what I proposed for calculating the copy number per centroid_99...

Yeah, let's have a more involved discussion about this one. I'm not sure what's best for the various audiences.

zhaoc1 commented 6 months ago

All the above mentioned changes have been implemented in version 2.0.0.rc2

czbiohub-sf / MIDAS

Feature request: Change how gene depth is estimated #127