DRL / blobtools

Modular command-line solution for visualisation, quality control and taxonomic partitioning of genome datasets
GNU General Public License v3.0
187 stars 44 forks source link

Distinguishing 'BLASTed and no match' and 'not blasted' in blob plots #54

Closed pinin4fjords closed 7 years ago

pinin4fjords commented 7 years ago

Hi,

For resource conservation we don't BLAST short contigs from our de novo assemblies. The consequence when using BAM files with reads mapped against a reference that does contain those contigs is Blob plots with grey clouds containing both contigs that weren't BLASTed and contigs that were BLASTed but produced no hits (and uninformative bars in the ReadCovPlot).

My proposal to deal with this is an additional parameter to specify the contigs we acutally BLASTed, producing plots with separate BLASTED and non-BLASTED no-hits categories. I'm happy to have a go at coding that if necessary. But is there a better way?

Thanks,

Jon

DRL commented 7 years ago

Hi Jon,

I believe there is. There are actually two ways to deal with this:

a) You can make a file of all contig_ids that have not been blasted (see below) and provide that file with the --catcolour.option to blobtools plot or covplot

contig_2,not_blasted
contig_10,not_blasted
contig_15,not_blasted
...

b) Just generate a 'hits' file with the contig_ids you did not blast. You can specify as TaxID '0' which is 'root' in the NCBI taxonomy (and will result in all taxonomic ranks set to 'undef') and a score. E.g:

contig_2\t0\t1000
contig_10\t0\t1000
contig_15\t0\t1000
...

You can then provide that file as an additional 'hits' file and when doing blobtools view -i blobDB --hits -r all you can recognise those by being undef at all ranks or based on the fact that they only got hits from that 'hits' file. However, if you want to plot them you still have to give it a catcolour file since otherwise they will be binned with other contigs that are annotated as 'undef' because of NCBI taxonomy (that happens if a taxon has not taxonomy at a given rank).

Let me know if this helps.

cheers,

dom

pinin4fjords commented 7 years ago

Right now we only care about the plots (blob and read coverage), so it'll be option a) I think- thanks for the quick response.

Jon