KwanLab / Autometa

Autometa: Automated Extraction of Genomes from Shotgun Metagenomes
https://autometa.readthedocs.io
Other
40 stars 15 forks source link

Calculation of coverage #239

Closed chtsai0105 closed 2 years ago

chtsai0105 commented 2 years ago

User checklist

Description

I was using the script autometa-bedtools-genomecov to calculate the coverage of contigs. But I found that the length of the contig in the output does not consistent with the one I calculated from the fasta. For example, I have a contig which is 516081 bp in length (calculated from fasta directly):

>NODE_1_length_516081_cov_4.430746
516081

But the output of autometa-bedtools-genomecov shows that it's only 511337 bp in length:

contig  depth_product   bases   coverage
NODE_1_length_516081_cov_4.430746       3029431 511337  5.924529224366709

I then checked the code and found that it seems like those uncovered regions did not count in bases in the coverage output: https://github.com/KwanLab/Autometa/blob/13591f069caf1d6a3c1f6dd91bfbadb3b8d356d0/autometa/common/external/bedtools.py#L105-L108 And from the bed I found there is 4744 bp covered 0 times

NODE_1_length_516081_cov_4.430746       0       4744    516081  0.00919236

516081-4744=511337. Exactly the number showed in the autometa-bedtools-genomecov output.

I just wonder is that is intended to ignore those uncovered regions? Since I'm also try on the Metabat2 pipeline and the script they use to calculate the contig coverage (jgi_summarize_bam_contig_depths) did include those uncovered regions.

contigName      contigLen       totalAvgDepth   sim_3.bam       sim_3.bam-var
NODE_1_length_516081_cov_4.430746       516081  5.818   5.818   9.50315

And I also want to know why to ignore those regions and what the difference between both calculation methods.

Expected Behavior

The contig length should be consistent between the output from autometa-bedtools-genomecov and fasta.

evanroyrees commented 2 years ago

Hello @chtsai0105 thank you for reaching out! You are quite right, this is not the intended behavior.

It appears you have found a bug! πŸ› πŸ™ˆ πŸ™Š

I have removed the df.depth != 0 line. The lengths should now match.

This was recently merged into main. I have also created another release 2.0.1 which has this fix πŸ‘

Great catch! πŸ‘€ πŸ”