chanzuckerberg / shasta

[MOVED] Moved to paoloshasta/shasta. De novo assembly from Oxford Nanopore reads
Other
272 stars 59 forks source link

.csv visualization #285

Closed RasmusBuntzen closed 2 years ago

RasmusBuntzen commented 2 years ago

Hi, I'm trying to tweak some of the configurations for my Shasta run. For this I want to see some of the plots created by my less successful runs, especially LowHashBucketHistogram.csv I however haven't been able to get any meaningful plots out of the csv files. Any general help of how to visualize the csv files would be greatly appreciated.

Thanks in advance, Rasmus

paoloczi commented 2 years ago

To understand bucket population in an assembly I use LowHashBucketHistogram.csv to do a scatter plot of FeatureCount (vertical axis) versus BucketSize (horizontal axis). This needs to be a scatter plot without lines. Then I adjust the horizontal scale, usually to a maximum of 50 or 100 depending on coverage, and I also do some manual adjustments of the vertical scale. The result is something like this:

image

This was done in LibreOffice Calc, but you could just as well do it in Excel or using Gnuplot.

The peak near zero shows low population buckets due to errors. The main peak (around 40 in this case) shows the "healthy" buckets. Their population is determined by coverage, with some loss due to errors. The buckets with extremely high population, excluded from the plot, are caused by repeats.

You want to make sure that --MinHash.minBucketSize and --MinHash.maxBucketSize bracket the body of the main peak. The figure above suggests something like --MinHash.minBucketSize 20 --MinHash.maxBucketSize 60.

Also, always feel free to post Shasta related questions here, particularly if you are unable to optimize assembly results to your satisfaction.

paoloczi commented 2 years ago

I am closing this due to lack of additional discussion.