chanzuckerberg / shasta

[MOVED] Moved to paoloshasta/shasta. De novo assembly from Oxford Nanopore reads
Other
271 stars 59 forks source link

A question about ONT data from Guppy v1.8.1 #173

Closed dxkeac closed 4 years ago

dxkeac commented 4 years ago

Hi,I have an older ONT data from Guppy v1.8.1, and I didn't find the assembly parameters for this version of Guppy v1.8.1 in the configuration file. I tried two ways. One is performed by referring to file https://github.com/chanzuckerberg/shasta/blob/master/conf/Nanopore-Dec2019.conf shasta-Linux-0.5.1 \ --input ${ONT} \ --MinHash.minBucketSize=5 \ --MinHash.maxBucketSize=30 \ --MinHash.minFrequency=5 \ --Align.minAlignedFraction=0.4 \ --Assembly.consensusCaller=Modal It worked and assembled a human genome.

The other command is shasta-Linux-0.5.1 \ --input ${ONT} \ --Reads.minReadLength=5000 \ --Assembly.consensusCaller=Modal The second failed and an error is 2020-Jul-28 21:10:43.295387 Disjoint set computation begins. 2020-Jul-28 21:13:18.145472 Disjoint set computation completed. 2020-Jul-28 21:13:18.149990 Finding the disjoint set that each oriented marker was assigned to. /opt/gridview//pbs/dispatcher/mom_priv/jobs/1720830.admin1.SC: line 28: 28142 Killed $dir/shasta-Linux-0.5.1 --input ${LCL5} --Reads.minReadLength=5000 --Assembly.consensusCaller=Modal

How do I adjust the parameters to run this ONT data (~100X; Mean read length: ~11kb; Median read length: ~6.7kb) from Guppy v1.8.1 ?

Thank you!

paoloczi commented 4 years ago

Given that we don't have a Bayesian model for Guppy 1.8.1, using the Modal caller like you did is a reasonable choice. You might get better sequence quality by using the Bayesian model for Guppy 2.3.1, that is replace --Assembly.consensusCaller Modal with --Assembly.consensusCaller Bayesian:guppy-2.3.1-a.

The second assembly ran out of memory, probably because of the extra coverage due to using a shorter read length cutoff. Unfortunately we don't have control over the unsatisfactory message that gets written in that situation.

Since your reads are shorter than reads generated by current nanopore technology, it does make sense for you to attempt to increase coverage by reducing read length cutoff. However you will need a larger memory machine. Is this for a human assembly? What was the total amount of coverage kept in this assembly? It should be reported in the log output (stdout) as "Total number of raw bases". How much memory does your machine have? A Shasta assembly typically requires 5 to 8 bytes per input base (counting only reads kept in the assembly). So for a human assembly at 100X you would need a machine with about 2 TB of memory.

If you reduce the read length cutoff, you may also consider reducing correspondingly the minimum number of aligned markers, controllled by --Align.minAlignedMarkerCount. Otherwise the shorter reads will not have much chance to participate in many useful alignments. With typical options such as you are using, the default of 100 aligned marker corresponds to roughly 3 Kb alignment length, and with a read length cutoff of 5 Kb you may want to reduce --Align.minAlignedMarkerCount to 50.

For a large assembly you may also want to consider the more optimal options --memoryMode filesystem --memoryBacking 2M - see here for more information. However, those options require root access.

dxkeac commented 4 years ago

Thank you very much for your help. It is indeed a human reference genome, and here is some raw reads and assembled log information using the first parameter I asked yesterday. 2020-Jul-28 10:52:36.526341 Done loading reads from 1 files. Read loading took 2356.7s. Discarded read statistics for all input files: Discarded 0 reads containing invalid bases for a total 0 valid bases. Discarded 20817159 short reads for a total 66254438039 bases. Discarded 2 reads containing repeat counts 256 or more for a total 69939 bases. Read statistics for reads that will be used in this assembly: Total number of reads is 13101469. Total number of raw bases is 311359782601. Average read length is 23765.3 bases. N50 for read length is 27855 bases. ...... 2020-Jul-28 16:02:31.550656 Assembled a total 2799479438 bases for 5898 assembly graph edges of which 2949 where assembled. The assembly graph has 6070 vertices and 5898 edges of which 2949 were assembled. Total length of assembled sequence is 2799479438 N50 for assembly segments is 16351206

The information about the discarded reads and the total number of bases is as follows: "Reads discarded on input, total": "Reads": 20817161, "Bases": 66254507978, "Fraction of reads discarded on input over total present in input files": "Reads": 0.6137, "Bases": 0.1755 Although more than half of reads (0.6137) were discarded, the corresponding bases were only 17%. I don't know whether --Reads.minReadLength=10000 needs to be modified to --Reads.minReadLength=5000.

There is no root permission. Are the following parameters appropriate? shasta-Linux-0.5.1 \ --input ${ONT} \ --Reads.minReadLength=5000 \ --MinHash.minBucketSize=5 \ --MinHash.maxBucketSize=30 \ --MinHash.minFrequency=5 \ --Align.minAlignedFraction=0.4 \ --Assembly.consensusCaller=Bayesian:guppy-2.3.1-a \ --Align.minAlignedMarkerCount=50

paoloczi commented 4 years ago

Based on the information you posted, that first assembly used 311 Gb of coverage, which means it was actually done at over 100X coverage despite having discarded 66 Gb of reads shorter than 10 Kb. As a result, your assembly results are reasonably good, having assembled 2799 Mb with an N50 of 16.4 Mb. This is not as good as what we got in our paper, but not too far from that either.

It might or might not be possible to improve on this result, considering that you are using older reads, possibly of lower quality than the ones we used in our paper.

Given that you already have high coverage, larger than the typical 60X to 80X for which we have optimized Shasta, I would keep --Reads.minReadLength and --Align.minAlignedMarkerCount unchanged and not decrease them. In fact, it may be appropriate to increase --Reads.minReadLength a little to bring coverage down from 100X to be closer to that Shasta "sweet spot" at 60X to 80X. To decide the appropriate value, you can look at Binned-ReadLengthHistogram.csv (column labeled CumulativeBases).

To make other parameter changes, it is important to know if this assembly was "fragmented" (most assembled segments begin/ end at a dead end - that is, they are not connected to another segment in the assembly graph) or "messy" (most assembled segments begin/end at a connection with another assembled segment). You can find out by looking at Shasta GFA output with Bandage. If the assembly is "messy", tightening alignment criteria (for example by increasing --Align.minAlignedMarkerCount and --Align.minAlignedFraction) might help. But if the assembly is "fragmented", the opposite is true. From the information you attached I suspect your assembly is "messy", but this would have to be verified.

If the assembly is "messy", using option --Assembly.detangleMethod 1 might also help. This turns on a basic version of detangling of the assembly graph.

To figure out if --Assembly.consensusCaller Bayesian:guppy-2.3.1-a gives better results than --Assembly.consensusCaller Modal you will have to do two runs differing by only that parameter and do some analysis of assembled sequence quality, for example by mapping to the reference genome or some other genome appropriate for your data. You could do this test on a subset of the genome, for example by separating reads for a single chromosome (by mapping to a portion of the reference genome) and using only those reads for a test assembly. We usually find that the Bayesian caller assembles much better sequence than the Modal caller, but in your case this might not be true because you are using a Bayesian model optimized for a different version of Guppy. It is possible, of course, that the Bayesian model for Guppy 2.3.1 also works reasonably well for your Guppy 1.8.1 reads.

I also suggest rerunning the base calling for these reads using a current Guppy version, if possible. Recent versions of Guppy generate reads of much better quality.

I may be able to give additional input if you also post AssemblySummary.html and LowHashBucketHistogram.csv for your successful assembly.

paoloczi commented 4 years ago

I am closing this due to lack of discussion. Feel free to reopen it or to create a new issue if additional questions emerge.