Open taylorreiter opened 4 years ago
Notes on running rmarkdown from snakemake:
Some good summary stats:
637 of the 2000 MAGs I ran have 0 bp of contamination
The average f_major was 95.1%
The average contaminant contig is 10,731 bp (sd 16,838bp) long (dirty_bp/dirty_n)
70 bins had an f_major less than 70%
10 bins had too few identifiable hashes with less that 20% identifiable
we had also talked, I think, about doing sourmash lca summarize --singleton
for all the contigs.
Tools like FastQC generate an html report that summarizes the quality of a single sample, while tools like MultiQC summarize FastQC into one report that includes metrics for all samples. @ctb and I think this would be useful for charcoal. Our current thought is to generate this report using R & RMarkdown. R has many libraries that I think would be good for visualizing contamination, while RMarkdown knits to html which will allow interactive plots.
This issue outlines suggested visualizations, as well as the output that charcoal would need to produce to support these visualizations.
Charcoal output formats
It would be helpful if charcoal could output three summary files, outlined below:
genomefile
: the file name/path of the genome analyzed by charcoal b.contig_name
: the name of the contig c.clean_or_dirty
: whether the contig isclean
ordirty
, i.e. whether the contig is likely a contaminant d.contig_length_bp
: length of the contig in basepairs e.reason
: reason a contig is specified as dirty (0-3) f.n_hashes_matched
: number of hashes that matched to the lineage g.lineage
: full lineage of the contig from root to species/strain h.chimeric
: whether the contig contains two or more lineages, reported asyes
/no
. There might be a better way to handle this information, but I'm not suresourmash compare
matrix (csv) of angular distance (--track-abundance
on) compare signatures of contigs using a k-mer size of 4.sourmash compare
matrix (csv) of jaccard distance (--track-abundance
off) compare signatures of contigs using a k-mer size of 21.Suggested visualizations for a single sample
clean
/dirty
.clean
/dirty
Suggested visualization summarizing all samples
I think there are more useful visualizations that I could come up with with the outputs outlined above. I can add to this issue as I think of more!