chanzuckerberg / shasta

[MOVED] Moved to paoloshasta/shasta. De novo assembly from Oxford Nanopore reads
Other
272 stars 59 forks source link

--Reads.minReadLength #250

Closed Saeideh-Ashouri closed 3 years ago

Saeideh-Ashouri commented 3 years ago

Hi. Thank you for developing such a nice program.

Regarding the minimum read length (which is by default 10000), I was wondering if it leads to losing many reads in my data as for many of my samples (Nanopore data) the average read length is around 6000bp. If so, will I need to deactivate it or assign a smaller value to it?

Thank you in advance,

paoloczi commented 3 years ago

Thank you for the positive comments.

Regarding the read length cutoff, and more in general all other assembly parameters and options: people use Shasta in a variety of different situations, and it is generally impossible to provide assembly parameters that are optimal in most or all cases. Specifically for the read length cutoff, you must be working with old Nanopore reads, because reads generated by current protocols are much longer, with N50 around 30 Kb to 50 Kb, and under those conditions a 10 Kb cutoff only results in losing a few percent of total coverage.

More in general assembly conditions can vary widely depending on the sequencing technology and the base caller used, the distribution of read lengths, the amount of coverage used, and genome characteristics such as heterozygosity and repeats. Because of this variability, you will find this message in the log output of every Shasta assembly:

Default values of assembly parameters are not recommended for any specific application and mostly reflect approximate compatibility with previous releases. See the shasta/conf or shasta-install/conf directory for sample configuration files containing assembly parameters for specific applications.

In your case it would be certainly a good idea to reduce the read length cutoff. However additional changes in assembly parameters may be also required to obtain an optimal assembly. I will be happy to help in the process. If your assembly is not satisfactory, please attach AssemblySummary.html here and I can make suggestions.

I also want to encourage you and anybody who has developed sets of assembly parameters that are known to work well under specific assembly conditions to package your assembly parameters into a Shasta configuration file and provide it in a GitHub Pull Request (PR) or simply attach it to an issue, so it can be added to shasta/conf and be made available to others. See here for information about Shasta configuration files.

Saeideh-Ashouri commented 3 years ago

Dear @paoloczi

Thank you so much for your prompt reply. Seems that the mean read length is different for different samples in my data. Would be great if you could kindly help me with making the configuration file for a sample with mean read length of 9,778 (based on the report made by NanoPlot). What minimum read length would you recommend? And what other parameters should I change besides minimum read length? I also have another question: I replaced the guppy version in the configuration file with the version we used for our data: guppy 3.2.8. But I got an error like “this version is not Bayesian”. I would like to know how I should set the version of base caller I used.

Looking forward to hearing from you and thank you in advance, Saeideh

paoloczi commented 3 years ago

You are right that there can be variability in the read length distribution, depending on the sequencing protocol used. The mean read length is not very informative, as it can be affected by a large number of very short reads. A more informative metric is the read N50. I don't know if NanoPlot reports that, but there are bioinformatic tools that can compute the N50 of reads in a fasta or fastq file (seqkit is one that I know of).

Regarding the Bayesian model: we don't have Bayesian models for every Guppy version. If your reads were created by Guppy 3.2.8, I suggest using the Bayesian model for Guppy 3.0.5.

Given the above, I suggest that you run an assembly using the following options (replacing as necessary the name/location of the configuration file and the file containing the reads):

shasta --config Nanopore-Sep2020.conf --input YourFastqOrFastaFile --Reads.minReadLength 6000 --Assembly.consensusCaller Bayesian:guppy-3.0.5-a

When the assembly completes, please post here the following two files:

- AssemblySummary.html
- Binned-ReadLengthHistogram.csv

You will have to create a zip or tar file with these two files, because GitHub does not support uploading files with arbitrary extension.

With that information, plus the approximate expected genome size, I may be able to give additional suggestions for improving assembly results.

Saeideh-Ashouri commented 3 years ago

Dear @paoloczi Thank you so much for your reply.

Based on seqkit, the N50 of this sample is 16,940 and based on NanoPlot, it is 20,218. The total number of bases for this sample (based on NanoPlot) is 84,966,305,912.

I ran Shasta using the config parameters you mentioned and the AssemblySummary.html and Binned-ReadLengthHistogram.csv files are attached here.

Thank you in advance for your time and consideration.

shasta-results.zip

paoloczi commented 3 years ago

The read N50 of 17-20 Kb is very reasonable to work with. You had a total 39 Gb of reads of which 7 Gb were discarded because they were shorter than 6 Kb, so the assembly used 32 Gb of coverage. This could be increased a bit by decreasing the read length cutoff further, but not by much. And decreasing the read length cutoff too much can cause other problems.

The assembly has 2.1 Gb of sequence but the N50 of assembled sequence is only 112 Kb. This is a symptom of insufficient coverage. Do you have an estimate for your genome size? If you assembled 2.1 Gb at low coverage, you genome size might be around 3 Gb (you did not say whether this is a human genome). Assuming that, the 32 Gb of available coverage is only 11x and insufficient for an assembly with good contiguity (N50).

Please confirm the estimate of genome size. Assuming my reasoning above is approximately correct, I don't think you will be able to get a reasonable de novo assembly from these reads. You will have to do more sequencing to get more reads. If these reads were from a single flowcell, I suggest adding a second flowcell, or perhaps even two if possible.

However, depending on your application you might not care about assembly contiguity. If this is the case, the 2.1 Gb of assembled sequence may be usable for some applications, perhaps after a round of assembly polishing using one of the many available polishers.

rlorigro commented 3 years ago

For future reference, a colleague and I also created this simple repository that computes detailed length information: https://github.com/rlorigro/great_lengths

It relies on htslib for the indexing step, which is generally very fast.

Saeideh-Ashouri commented 3 years ago

Dear Paolo (@paoloczi )

Thank you so much for interpreting the results. It was quite helpful. The data is a human genome and the coverage is 15x, and as far as I know, it is from a single flow cell. Now, I have two questions: 1) Seems that almost all our samples have similar characteristics (N50, average read length, and coverage). Considering that, what configurations do you recommend me use for doing the assembly? In configuration files it is written that coverage of the data should be between 40x and 80x. 2) Considering that it might not be possible for us to add second and third flow cells, do you think these results can be used for calling structural variants (after conducting assembly polishing of course)?

P.S: We also have short-read data (Illumina) of these samples.

And another question I forgot to ask before, why did you ask me to use Nanopore-Sep2020.conf and not Nanopore-Dec2019.conf?

Thank you in advance, Saeideh

Saeideh-Ashouri commented 3 years ago

For future reference, a colleague and I also created this simple repository that computes detailed length information: https://github.com/rlorigro/great_lengths

It relies on htslib for the indexing step, which is generally very fast.

Dear @rlorigro

Thank you so much for your comment. I will definitely try it.

Best, Saeideh

paoloczi commented 3 years ago

Actually, for a human genome, coverage of your reads is lower than your estimate 15x. According to the Shasta assembly summary, the assembly used 32 Gb of reads and discarded 7 Gb of reads shorter than 6 Kb, so the total amount of sequence in your reads was 39 Gb. For a 3.1 genome size (including centromeres), this is 13x coverage. And the assembly only used the 32 Gb of reads shorter than 6 Kb, so it was working at 10x. We have never been able to obtain assemblies with acceptable contiguity at such low coverage.

On your numbered questions:

  1. I am a bit surprised that your sequencing protocol only yielded 32 Gb per flowcell in reads longer than 6 Kb (even less if we used the standard 10 Kb cutoff). With the protocols used by the sequencing group at UC Santa Cruz, they get around 20x per flowcell in reads longer than 10 Kb. With the yield you are getting, for a good quality assembly you would need 4 to 6 flowcells, but you may be able to get a reasonable assembly with 2 or 3.

  2. It is possible that you will be able to extract at least some homozygous structural variants from the assembly you have, because the assembly N50 of 113 Kb is much longer than the typical long insertion/deletion you will be interested in. But things will be harder for heterozygous variants, because with the options you used Shasta creates a haploid assembly after discarding the weaker side of each bubble. So if you want to do heterozygous structural variants you would have to tell Shasta to do less aggressive bubble removal. You could achive that via --MarkerGraph.simplifyMaxLength 3 for example. See here for more information on bubble removal in Shasta. But even in that case you would need some amount of non-trivial processing of the gfa output, and you would probably be able to only extract a portion of the heterozygous structural variants. We found that, even at higher coverage, we typically can only recover about 65% of heterozygous structural variants. If you want to try something along these lines I can give you more information, but you could perhaps start with homozygous structural variants only, which simply require mapping the current assembly to a reference - probably after polishing, like you said. I am working on Shasta improvements that, if successful, will make it possible to obtain separate assemblies for each haplotype.

Nanopore-Sep2020.conf uses a new adaptive method to create the read graph, --ReadGraph.creationMethod 2, developed by @rlorigro, that is generally less sensitive to assembly parameters and as a result tends to give good assembly results in a variety of situations. It also uses other Shasta functionality that was not when available when Nanopore-Dec2019.conf was created. So for your situation it is best to use Nanopore-Sep2020.conf, but overriding the consensus caller with --Assembly.consensusCaller Bayesian:guppy-3.0.5-a (and here I should mention that @rlorigro is also the author of all of the Shasta Bayesian models). I can also mention that newer versions of Guppy have achieved tremendous improvements in read quality. If you redo base calling of your reads using a current Guppy version (3.6.0 or newer) the quality of your reads will improve significantly. And if you get new data, make sure that a current Guppy version is used for those. Running a Shasta assembly with mixed reads is tricky, so in that case you would want to redo the base calling for your current reads anyway.

Good luck and feel free to ask more questions as your project progresses.

rlorigro commented 3 years ago

UCSC generally uses promethION flow cells, which have greater throughput than minION, so that could explain the difference if @Saeideh-Ashouri is using minION

Saeideh-Ashouri commented 3 years ago

Dear Paolo and @rlorigro

Thank you so much for your kind help. Our data is produced by PromethION. Currently, I am trying to do the base calling again using the latest version you recommended me.

Thank you again, Best, Saeideh

Saeideh-Ashouri commented 3 years ago

Dear Paolo

After some investigations, I realized that we had two flow cells for each sample (sorry for lack of enough information in my previous comments, I am quite new in this project). So, I concatenated all fastq files and ran Shasta on them. The current data is base called using guppy 3.2.8 (I have asked our collaborators for base calling using newer versions of guppy but haven't heard back from them yet). This time, I set the minimum read length as 2000bp; 3000bp; 4000bp; and 5000bp. And I used Nanopore-Sep2020.conf. I attach the results here as a zip file. Would you please kindly take a look at the results and let me know your idea please? This time the results look much better but I would like to know what you think about using the results for calling SVs.

Thank you so much for your time and consideration, Saeideh

Shasta-compare-different-read-lengths.zip

paoloczi commented 3 years ago

Here is, for reference, a summary of some assembly metrics for your 4 assemblies:

image

Coverage, at 25-27x, is now sufficient to get a reasonable assembly, but still quite low. These assembly metrics are typical of what we are usually able to achieve at such low coverage, so I don't think you will be able to improve results significantly by optimizing assembly options.

These assemblies are probably usable for SV detection, and based on previous experience I expect they should do quite well for homozygous structural variants. However, for heterozygous structural variants you will be limited by the fact that the assembly is haploid. At each particular heterozygous locus, the assembly will only assemble one of the two haplotypes - presumably the one with higher coverage. If the haplotype that gets assembled is the one that contains the het SV, you will see the SV. Otherwise, you will not. So this gives you an upper limit of 50% for your sensitivity for het SVs.

You could improve on this a bit by reducing the amount of bubble removal (item 2 of my previous message). But even with that your sensitivity for het SVs will not be great, and in addition for that you will need to do some non-trivial processing on the GFA assembly graph.

I am working on Shasta improvements that should help with this, but this work is not yet at a usable stage.

paoloczi commented 3 years ago

I am closing this due to lack of discussion, but feel free to reopen it or create a new one if new discussion topics emerge.