dib-lab / charcoal

Remove contaminated contigs from genomes using k-mers and taxonomies.
Other
53 stars 1 forks source link

Requested output format for Charcoal report html #13

Open taylorreiter opened 4 years ago

taylorreiter commented 4 years ago

Tools like FastQC generate an html report that summarizes the quality of a single sample, while tools like MultiQC summarize FastQC into one report that includes metrics for all samples. @ctb and I think this would be useful for charcoal. Our current thought is to generate this report using R & RMarkdown. R has many libraries that I think would be good for visualizing contamination, while RMarkdown knits to html which will allow interactive plots.

This issue outlines suggested visualizations, as well as the output that charcoal would need to produce to support these visualizations.

Charcoal output formats

It would be helpful if charcoal could output three summary files, outlined below:

  1. Contig summary csv: a csv file with columns a. genomefile: the file name/path of the genome analyzed by charcoal b. contig_name: the name of the contig c. clean_or_dirty: whether the contig is clean or dirty, i.e. whether the contig is likely a contaminant d. contig_length_bp: length of the contig in basepairs e. reason: reason a contig is specified as dirty (0-3) f. n_hashes_matched: number of hashes that matched to the lineage g. lineage: full lineage of the contig from root to species/strain h. chimeric: whether the contig contains two or more lineages, reported as yes/no. There might be a better way to handle this information, but I'm not sure
  2. sourmash compare matrix (csv) of angular distance (--track-abundance on) compare signatures of contigs using a k-mer size of 4.
  3. sourmash compare matrix (csv) of jaccard distance (--track-abundance off) compare signatures of contigs using a k-mer size of 21.

Suggested visualizations for a single sample

  1. Metacoder plot of all lineages detected in the sample, colored by number of contigs with that lineage, or by number of hashes that matched to that lineage 36000382-9bfb728c-0cd7-11e8-8aa5-3e47b2e089b4
  2. Density plot with basepairs on the x axis depicting the length in basepairs of contigs, colored by clean/dirty.
  3. tSNE on the two compare matrices, colored by clean/dirty
  4. Table summarizing the number of contaminated contigs and the reasons they are contaminated, number of uncontaminated contigs.

Suggested visualization summarizing all samples

  1. heatmap of number of hashes or basepairs detected per lineage per sample. The example below is black and white, but a color scale could be added to convey the amount of contamination. The y axis would also be labeled by lineage. chap11-heatmapCCNB1-1
  2. bar chart of number of hashes or base pairs detected as contamination or not contamination for each sample.

I think there are more useful visualizations that I could come up with with the outputs outlined above. I can add to this issue as I think of more!

taylorreiter commented 4 years ago

Notes on running rmarkdown from snakemake:

taylorreiter commented 4 years ago

Some good summary stats:

637 of the 2000 MAGs I ran have 0 bp of contamination 
The average f_major was 95.1% 
The average contaminant contig is 10,731 bp (sd 16,838bp) long (dirty_bp/dirty_n)
70 bins had an f_major less than 70%
10 bins had too few identifiable hashes with less that 20% identifiable
ctb commented 4 years ago

we had also talked, I think, about doing sourmash lca summarize --singleton for all the contigs.