DRL / blobtools

Modular command-line solution for visualisation, quality control and taxonomic partitioning of genome datasets
GNU General Public License v3.0
184 stars 44 forks source link

bam2cov question #24

Closed JohnUrban closed 8 years ago

JohnUrban commented 8 years ago

Hi,

Thanks for the great tools.

I have used blobtools bam2cov to get the coverage values from a different set of reads than used to make an assembly.

The output, as you know, looks something like this:

# Total Reads = 20745678
# Mapped Reads = 19921290
# Unmapped Reads = 824388
# Parameters : MQ = 0, No_base_cov_flag = False
# contig_id     read_cov        base_cov
seq656447_len82_cov114  8       4.87804878049
seq7374_len1367_cov82   47      1.71909290417
seq411348_len309_cov55  19      3.07443365696
seq246648_len163_cov164 11      3.37423312883
seq679521_len1744_cov77 137     3.92717889908
.
.
.
.

It looks like base_cov is simply read_length*read_cov/contig_length. Let me know I am wrong in that.

Is it the 3rd column (base coverage) that makes more sense to use with blobtools create since it normalizes the read count by contig length? If so, then I guess one needs to take columns 1 and 3 into a new file to be able to pass to blobtools create using --cov ... correct?

Thanks for your time and thoughts.

--John

JohnUrban commented 8 years ago

While on the topic -- does it make more sense to just skip bam2cov and use --bam with blobtools create ?

DRL commented 8 years ago

Hi John,

sorry for not replying earlier.

blobtools bam2cov was meant for when one has many BAM files and doesn't want them to be parsed serially by blobtools create. One would run

parallel -j X 'blobtools bam2cov ...' ::: *.bam

to generate COV files which contain all the necessary information from a BAM file.

Previously we simply had a COV format which was "contig\tbasecov". But some users wanted also the readplot figures for read dataset QC (for which total number of reads/number of mapped reads and read-coverage is needed), so I implemented the new COV format which has a header and shows read and base-coverage. Read coverage is simply the number of mapped reads for each contig. Base coverage is the number of mapping bases (digit before M's in CIGAR string) divided by the number of AGCT's in the contig.

Blobtools create knows about both COV formats, so no need to modify it. Parsing COV is faster than BAM.

Hope this helps.

cheers,

dom