Shasta on 30 X data - Githubissues

chanzuckerberg / shasta

[MOVED] Moved to paoloshasta/shasta. De novo assembly from Oxford Nanopore reads

Other

272 stars 59 forks source link

Shasta on 30 X data #289

Closed ziphra closed 2 years ago

ziphra commented 2 years ago

Hello, Thank you for Shasta.

I have been trying to run an assembly of human genome with nanopore ultra-long reads with a mean coverage of 30X.

According to #7, I tweaked the parameters for my low coverage data.

I first ran Shasta in its default mode: stdout1.log AssemblySummary1.txt

shasta \
    --input /home/euphrasie/Documents/HG002_PAG07506/HG002_PAG07506.fastq \
    --config Nanopore-UL-Jan2022 \
    --thread 16 \
    --memoryMode filesystem \
    --memoryBacking disk \
    --Assembly.mode 0 \
    --assemblyDirectory /home/euphrasie/Documents/HG002_PAG07506/ShastaRun

In an other run I scaled the following parameters proportionnaly to my coverage (/2 their default value): MinHash.maxBucketSize, MarkerGraph.minCoverage, MarkerGraph.maxCoverage, MarkerGraph.maxCoverage, MarkerGraph.edgeMarkerSkipThreshold, but the Assembly.fasta file was empty.
stdout2.log AssemblySummary2.txt

shasta \
    --input /home/euphrasie/Documents/HG002_PAG07506/HG002_PAG07506.fastq \
    --config Nanopore-UL-Jan2022 \
    --thread 16 \
    --memoryMode filesystem \
    --memoryBacking disk \
    --Assembly.mode 0 \
    --MinHash.maxBucketSize 5 \
    --MarkerGraph.minCoverage 5 \
    --MarkerGraph.maxCoverage 50 \
    --MarkerGraph.edgeMarkerSkipThreshold 50 \
    --assemblyDirectory /home/euphrasie/Documents/HG002_PAG07506/ShastaRun2

I did another run only changing MarkerGraph.minCoverage and MarkerGraph.maxCoverage. I did not went trough this run as it was taking too long comparing to Shasta default mode (more than 60 hours) and as I needed to free some informatics resources to continue my work.

I ran it a second time with the same parameters after I stopped the run, but I deleted the first run (to free space), so the log is inaccurate. When I stopped it after ~60 hours the first time, a couple of "Alignment candidates after lowhash" lines were added at the end of the log, but the second time I did not wait until this point... stdout3.log

shasta \
    --input /home/euphrasie/Documents/HG002_PAG07506/HG002_PAG07506.fastq \
    --config Nanopore-UL-Jan2022 \
    --thread 16 \
    --memoryMode filesystem \
    --memoryBacking disk \
    --Assembly.mode 0 \
    --MarkerGraph.minCoverage 5 \
    --MarkerGraph.maxCoverage 50 \
    --assemblyDirectory /media/euphrasie/DATA/ShastaRun3bis

I wonder if it was normal for this last run to take that long? And why the second run did not produce any assembly?

Many thanks in advance,

ziphra

paoloczi commented 2 years ago

For your first assembly, AssemblySummary1.txt shows that you assembled 2828 Mb of sequence with an N50 of 38 Mb. This is an excellent result at this low coverage, and it will be hard to improve on it. Particularly considering that the assembly used about 61 Gb of coverage, corresponding to only about 20x. Another 22 Gb of coverage present in the input were discarded, mostly because they consisted of reads shorter than the 50 Kb read length cutoff used in the Nanopore-UL-Jan2022 assembly configuration. You could consider reducing that read length cutoff a bit to get more coverage, but that will also reduce the read N50, and the overall effect is impossible to predict.

Can I ask what you were trying to improve in the other assemblies?

In your second assembly, decreasing --MinHash.maxBucketSize to 5 resulted in no alignment candidates being found by the MinHash algorithm. This is the cause for the empty assembly. To experiment with --MinHash.minBucketSize and --MinHash.maxBucketSize it is often useful to look at LowHashBucketHistogram.csv (see #285 for more details).

For the third assembly, what you are seeing is probably a symptom of insufficient memory. By using --memoryBacking disk you are letting Shasta store assembly data on memory mapped files on disk, which can slow you down a lot, particularly if your machine does not use fast SSD as storage. For diagnosing performance information, it may also be useful to look at performance.log. It contains more messages with time stamps.

From the output of your first, successful assembly, I see that peak virtual memory usage was 405 GB. If you are running on a machine with less memory than that, you can experience substantial slowdowns. On the other hand, if you have a machine with 512 GB or more, I suggest one of the following:

If you have root access to the machine and the machine has sufficient memory to store the assembly, you can get optimal performance with --memoryMode filesystem --memoryBacking 2M. Under these conditions, and with an appropriate number of cores, a human assembly at this low coverage should complete in about 2 hours.
If you don't have root access and the machine has sufficient memory to store the assembly, don't specify any memory options. That will run faster than with the options you are currently using (which, however, are you only choice if your machine does not have enough memory).

Finally, I want to mention that Shasta 0.10.0, which was released yesterday, includes a new configuration Nanopore-Human-SingleFlowcell-May2022 specialized for human assemblies at low coverage. However that configuration is not for Ultra-Long (UL) reads.

In summary, it seems to me that you can be satisfied with your first assembly, considering the low amount of available coverage.

ziphra commented 2 years ago

Hi,

Thanks for the prompt, detailed response.

Can I ask what you were trying to improve in the other assemblies?

Concerning what I was trying to improve in my other assemblies, I guess I wanted to make sure I was getting the most out of my data. But I am glad to hear that the assembly produced excellent results according to you. I wasn't sure what to expect as it was my first attempt at long read de novo assembly.

it is often useful to look at LowHashBucketHistogram.csv

The FeatureCount versus BucketSize plot gives a main peak around 9. I figure that I should set --MinHash.minBucketSize and MinHash.maxBucketSize between 5 and 15 for my data.

By using --memoryBacking disk you are letting Shasta store assembly data on memory mapped files on disk, which can slow you down a lot, particularly if your machine does not use fast SSD as storage.

Unfortunately, I only have access to a machine with 64GB and another with 128GB of RAM. that's the reason why I used --memoryBacking disk. I am also just realising that I ran my first assembly on SSD as I was using --memoryBacking disk, and then ran the third one on a disk. That should explain the long running time.

Happy to hear that Shasta 0.10.0 was just released. I will test it out and compare it with my first assembly.

Many thanks for all the insights and tips.

ziphra

paoloczi commented 2 years ago

You are right, based on the plot you posted, --MinHash.minBucketSize 5 and --MinHash.maxBucketSize 15 is a good choice. However in your second assembly these options were set as follows:

You did not explicitly set --MinHash.minBucketSize in the command line, which means that it stayed at 10, the value set via --config Nanopore-UL-Jan2022 (use shasta --command listConfiguration --config Nanopore-UL-Jan2022 to see that).
You used --MinHash.maxBucketSize 5.

This means that the bucket size interval in use was empty (a minimum value of 10 and a maximum value of 5), which explains why no alignment candidates were found. I will add a check for an empty bucket size interval and stop the assembly if that situation is detected.

Given the small machine sizes you are using, the memory options you used, --memoryMode filesystem --memoryBacking disk, are the only viable choice. Using the default options would run faster if you had enough memory. But with insufficient memory your assembly would stop with an error message. In fact, I am actually surprised that you are able to run a human assembly in only 7 hours using such a small machine (these days, 128 GB of memory are even available on a laptop). If the budget situation of your department permits, you could consider using one of the cloud computing providers. Machines with hundreds of GB of memory are available at a cost of a few dollars per hour, and your assembly would run in about two hours on one of these machines, with the appropriate memory options.

Depending on your application, you may want to also consider using Shasta phased diploid assembly, which assembles the two haplotypes separately in a good fraction of the genome (but not all of it). If you try this, I suggest doing it using Shasta 0.10.0, which has significant improvements in this area.

ziphra commented 2 years ago

Thanks for the insights.

It is not fun to be limited by the machine I am using, but at least it does the job - for now.

So I ran an other assembly by setting the bucket size correctly and other coverage dependant parameters as follow:

shasta \
    --input /home/euphrasie/Documents/HG002_PAG07506/HG002_PAG07506.fastq \
    --config Nanopore-UL-Jan2022 \
    --thread 16 \
    --memoryMode filesystem \
    --memoryBacking disk \
    --Assembly.mode 0 \
    --MinHash.minBucketSize 5 \
    --MinHash.maxBucketSize 10 \
    --MarkerGraph.minCoverage 5 \
    --MarkerGraph.maxCoverage 50 \
    --MarkerGraph.edgeMarkerSkipThreshold 50 \
    --assemblyDirectory /home/euphrasie/Documents/HG002_PAG07506/ShastaRunLowX

It assembled 2860Mb with a 49 Mb N50, which I believe represent a better assembly than the first one. AssemblySummary.txt

However, I was still using Shasta 0.9.0, and will now continue with Shasta 0.10.0 and am curious to try Shasta phased diploid assembly.

paoloczi commented 2 years ago

Wow - a 49 Mb N50 at 20X coverage (this is what the assembly is seeing after discarding the shorter reads) is something I have never seen on a Shasta assembly. Probably thanks to the Ultra-Long reads.

For haploid assemblies, the only improvement in Shasta 0.10.0 is a new Bayesian caller for repeat counts, which will give better sequence accuracy, but will not affect significantly large scale assembly measures like the amount of assembled sequence.

If you want to try phased assembly, Shasta 0.10.0 includes two assembly configurations that approximate your situation but don't quite match it exactly:

Nanopore-UL-Phased-May2022 is for phased assembly of UL reads at high coverage. You have UL reads but not high coverage.
Nanopore-Human-SingleFlowcell-Phased-May2022 is for phased assembly of non-UL reads at low coverage, so it is similar to your situation with regards to coverage, but it is for non-UL reads.

I suggest trying the first one first, then the second one. And it may be necessary to tweak assembly parameters to take elements of each of these configurations. If you end up trying this, I can help in the process. If you like, feel free to contact me privately at the e-mail address indicated in our 2020 paper cited in the Shasta GitHub page (README file).

See here for more information on Shasta phased assembly and the output it creates.

ziphra commented 2 years ago

Probably thanks to the Ultra-Long reads, also the data I have been using is an Oxford Nanopore open dataset - might help too?

I will definitely reach out to you as I am moving forward in assembly testing.

Thank you

ziphra commented 2 years ago

Hi,

you said previously that,

You did not explicitly set --MinHash.minBucketSize in the command line, which means that it stayed at 10, the value set via --config Nanopore-UL-Jan2022 (use shasta --command listConfiguration --config Nanopore-UL-Jan2022 to see that).

You used --MinHash.maxBucketSize 5. This means that the bucket size interval in use was empty (a minimum value of 10 and a maximum value of 5), which explains why no alignment candidates were found. I will add a check for an empty bucket size interval and stop the assembly if that situation is detected.

However, from --config Nanopore-UL-Jan2022 it seems that --MinHash.minBucketSize is set to 0 and --MinHash.maxBucketSize to 10. So I believe that my bucket size interval was in fact 0 to 5 but since low population buckets are mainly due to errors (according to #285 ) it can also explained the empty assembly.

paoloczi commented 2 years ago

You may be correct, but I don't see that. The output of shasta --command listConfiguration --config Nanopore-UL-Jan2022 includes the following:

[MinHash]
minBucketSize = 10
maxBucketSize = 50

If you see otherwise, please give me more details as it could be an indication of a problem or bug, and I want to make sure to resolve it.

ziphra commented 2 years ago

Sorry, you are right. I was seeing otherwise because I typed the wrong command and it gave me the default parameters.

paoloczi commented 2 years ago

I am closing this due to lack of discussion. Feel free to create a new issue if new discussion topics emerge.