Advice on optimising assembly for a region containing segmental duplication

smf96 commented 2 years ago

Hello,

I would really appreciate some advice on which parameters I could tweak to improve my assembly. I have ONT data from a cas9 enrichment library prep where my target is a 200kb region in the human genome. This region contains a large segmental duplication with 2x 90kb paralogous blocks. So far I have aligned the reads, called SNVs, phased them and extracted the reads that map to the locus per haplotype. I want to de novo assemble this region per haplotype as due to mapping/assembly issues with short-reads at this locus of high-homology, it's not clear how far we can trust the current reference and some individuals harbour CNAs too.

I would really like to use shasta for my assemblies as it's very impressive how fast it runs but I can't manage to get the same output (1 contig per haplotype) as when I use Canu. The following is example output from just 1 haplotypes reads.

I have used the 'Nanopore-Oct2021' config which produced 2 contigs (69.5kb and 81.8kb) and here is the feedback:

I also used the ''Nanopore-Oct2021' but changed the min read length to 6kb. That generated 2 contigs of 48.9kb and 81.4kb. Heres the feedback for that:

I then used the config made by the GenerateConfig.py but had to change the min read length to 4.5kb as otherwise didn't have enough coverage. That produced two contigs 66.1kb and 81.9kb. Heres the feedback from that:

Here is the contents of the 'generatedShasta.conf':

[Reads] minReadLength = 10000 desiredCoverage = 12000000 noCache = True

[Kmers] k = 10

[MinHash] minHashIterationCount = 10 minBucketSize = 5 maxBucketSize = 30 minFrequency = 5

[Align] alignMethod = 3 downsamplingFactor = 0.05 matchScore = 6 maxSkip = 100 maxDrift = 100 maxTrim = 100 minAlignedFraction = 0.1 minAlignedMarkerCount = 10 sameChannelReadAlignment.suppressDeltaThreshold = 30

[ReadGraph] creationMethod = 2

[MarkerGraph] simplifyMaxLength = 10,100,1000,10000,100000 refineThreshold = 6 crossEdgeCoverageThreshold = 3 minCoverage = 0

[Assembly] consensusCaller = Bayesian:guppy-3.6.0-a detangleMethod = 2

Any guidance would be much appreciated!!

Thank you

paoloczi commented 2 years ago

Do you know what version of the Guppy base caller was used to create these reads?

Please also attach AssemblySummary.html from the assembly done with the Nanopore-Oct2021 config.

Is it possible that there is a bit of coverage drop near the edge of the regions of interest?

smf96 commented 2 years ago

I'm not sure exactly which version of guppy was used, the reads are from 2 separate cas9 runs from March and November this year basecalled in real-time by MinKnow with the HAC model selected.

Here is the AssemblySummary.html:

Here is a screenshot from IGV, it's about 30x at it's lowest points

paoloczi commented 2 years ago

I am not familiar with this base caller. I will research it and ask around tomorrow.

Even though the coverage plot is reasonable, the pileup has strange features with several sets of reads at exactly the same starting and ending position. Do you know what is the reason for that? If the reads are really distributed in that way, that could be a problem for the assembler. Any possibility that the assembly contains large numbers of duplicated or approximately duplicated reads? Output file ReadGraphComponents.csv could contain some useful information with regards to that, if you could post that too.

Also, do you know if the two copies are reverse complemented relative to each other? If they are, adding --ReadGraph.strandSeparationMethod 2 to the command line might help.

If you would be willing to share the reads privately, I would be happy to experiment a bit. You can find my e-mail address in the Shafin 2020 paper listed in the top level page (README) of the Shasta repository on GitHub. If you share the reads, I will not share them with anybody else, I will only use them for this investigation, and I will delete all copies when done.

smf96 commented 2 years ago

I have just checked the reports generated from the sequencing runs and MinKnow was using the guppy versions v4.2.3 in March and v5.0.16 in November.

The peculiar pileup is due to the cas9 targeting design from the first run. The 200kb locus was targeted using a tiling approach with 2 pools. Each pool contains 5 x 30-40kb regions on interest (ROI) that overlap with the other pools ROIs. Each ROI is excised using 4 probes, 2 targeting the forward strand and 2 targeting the reverse. This is recommended by nanopore for maximum coverage because of redundancy from inefficient cutting. The pileup of shorter reads occurs due to both outer target sites being cut with high efficiency. For the second run I changed my cas9 design to target much longer ROIs and with only 1 cut per ROI.

Here is the output from ReadGraphComponents.csv:

That would be brilliant, I have sent you a link for the reads through my university's drop-off service. Thank you very much, I really appreciate it.

paoloczi commented 2 years ago

I had not noticed from your initial message that this was a targeted library - I was under the impression that the reads had been extracted from a full genome run via mapping. Entirely my fault, as your message was very clear about that.

Shasta is not well suited to work with targeted libraries as it generally assumes approximately uniform coverage. This is probably the reason that it was not able to assemble your region in one piece, at least with standard settings.

I will experiment a bit with the reads you made available to see if I can come up with custom assembly parameters that work.

paoloczi commented 2 years ago

With that standard config, the assembly is mixing two highly similar copies of a portion of your region of interest. I was able to obtain a single contig with manual selection of alignment criteria and other assembly parameters using the folllowing command line options:

--input *.fq --config Nanopore-Oct2021 --Kmers.k 10 --Reads.minReadLength 15000 --MinHash.minHashIterationCount 100 --Align.minAlignedMarkerCount 800 --Align.minAlignedFraction 0.7 --Align.maxSkip 30 --Align.maxTrim 35 --Align.maxDrift 20  --ReadGraph.creationMethod 0 --MarkerGraph.minCoverage 20

(This is a long line - make sure to copy it in its entirety - use the scroll bar to scroll all the way to the right if necessary).

With these options, Shasta 0.8.0 assembles a single gfa segment (contig) of 183207 bases that covers approximately chr1:161500-161683 Kb in hg38 coordinates, which is your region of interest. This assembly ran in about 30 seconds on my laptop.

smf96 commented 2 years ago

Wow that is fantastic thank you so much

While I'm super impressed with how rapidly it runs, unfortunately I don't seem to be getting the same results as you, I tried these commands:

and the hap1 reads are generating 2 contigs of 81.8kb and 69.3kb whilst hap2 reads are producing 2 contigs of 67.4kb and 98.7kb? I can't see how I've changed the command you gave, I even copied and pasted it! Do you have any idea how that could happen?

paoloczi commented 2 years ago

Oops - I did not realize you wanted to assemble the two haplotypes separately. My result was for a single assembly using all of the reads. Tomorrow I will do some more experiments using one haplotype at a time. It may also be possible to use Shasta phased assembly, which is a new feature in release 0.8.0.

I will let you know what I find.

smf96 commented 2 years ago

Oh cool the phased assembly workflow sounds great too

We were originally aiming for assembling each haplotype separately as we were assuming the two copies of the segmental duplication will cause some confusion anyway so asking it to delineate four copies may be too much. Plus, in some individuals there is also CNVs which I guess will add even more possibilities for mixing up. This sample (CRA183) is suspected to have no additional CNV (on top of the segmental duplication) so is hopefully the simplest one to test Shasta out with

Thanks so much for your time on this!

paoloczi commented 2 years ago

With some additional tweaking of assembly parameters I was able to obtain a single contig assembly for each haplotype.

For hap1 I used

shasta-Linux-0.8.0 --input *_hap1_*.fq --assemblyDirectory hap1 --config Nanopore-Oct2021 --Kmers.k 10 --MinHash.minHashIterationCount 100 --Align.minAlignedMarkerCount 1000 --Align.minAlignedFraction 0.7 --Align.maxSkip 30 --Align.maxTrim 35 --Align.maxDrift 20  --ReadGraph.creationMethod 0 --ReadGraph.strandSeparationMethod 2 --ReadGraph.maxChimericReadDistance 0 --MarkerGraph.minCoverage 8

This gave a single contig, 183 Kb in length, covering chr1:161500-161683 Kb (hg38 coordinates).

For hap2 I used

shasta-Linux-0.8.0 --input *_hap2_*.fq --assemblyDirectory hap2 --config Nanopore-Oct2021 --Kmers.k 10 --MinHash.minHashIterationCount 100 --Align.minAlignedMarkerCount 800 --Align.minAlignedFraction 0.7 --Align.maxSkip 30 --Align.maxTrim 35 --Align.maxDrift 20  --ReadGraph.creationMethod 0 --ReadGraph.strandSeparationMethod 2 --ReadGraph.maxChimericReadDistance 0 --MarkerGraph.minCoverage 8

This gave a single contig, 191 Kb in length, covering chr1:161496-161687 Kb (hg38 coordinates).

The slight difference in the required options for the two haplotypes is probably related to the fact that different base callers were used to generate the reads for the two haplotypes. Also note that the above options use the Shasta Bayesian consensus caller for Guppy 5.0.7. For the haplotype done using Guppy 4.2.3 if would be best, for optimal assembly accuracy, to add the following to the above options:

--Assembly.consensusCaller Bayesian::guppy-3.6.0-a

I did not do that because you did not specify which of the two haplotypes was done with Guppy 4.

We don't have a Bayesian consensus caller for Guppy 4, and the above choice is the best approximation.

I also tried Shasta phased assembly. You are right that separating four copies is hard, and Shasta phased assembly uses a diploid assumption. Using Shasta 0.8.0 I was only able to obtain a partially phased assembly. But with a current Shasta development version I was able to obtain an assembly that not only resolves the various gene copies correctly, but is also fully phased. This functionality will be available in the next Shasta release, presumably Shasta 0.9.0. That assembly looks like this:

The 163 Kb green segments are the two phased haplotypes, which cover approximately chr1:161505-161669 Kb (hg38 coordinates). The red segments are regions that are assembled haploid near the edge of the test region, where coverage is low. The two haplotypes have almost the same length, which indicates a lack of structural variants between the two.

This demonstrates that Shasta phased assembly would be able to separate haplotypes even if you had not done separate sequencing runs.

I am attaching the phased assembly in case you want to look at it. The contigs you care about are the two with names that begin with PR.

Assembly-Phased.fasta.gz

smf96 commented 2 years ago

Wow how exciting that is all really fantastic! Look forward to your next release

With your commands in v0.8.0 I have been able to also generate 1 contig per haplotype - really brilliant thank you

I may not have explained myself properly though - the different haplotypes were not generated in separate runs at different times. Instead both haplotypes were targeted but with different cas9 designs, the first cas9 design sequenced in March had >30 separate cuts to excise 10 ROIs (30-40kb) with 4 cuts each whereas the second run in November had 6 ROIs (70-80kb) with 1 cut each. What else may be able to explain the need for different commands do you think? In your documents it explains the --Align.minAlignedMarkerCount can be used to discard alignments that are too short to be considered reliable, could it just be they have different read length distributions?

Also relating to this I would be very interested in understanding a bit more about how you were able to optimise the parameters. I am now trying it on another sample but this individual is suspected to have CNV, in particular copy number loss. One haplotype's reads has generated 1 contig of 120kb (which is super exciting for us as we've never been able to produce this or understand the breakpoints before!) but the other haplotype is still making 2 contigs of around 80kb each, I've tried with the --Align.minAlignedMarkerCount set to both 1000 and 800. Any advice would be much appreciated

Thanks again!

paoloczi commented 2 years ago

Glad this helped and thank you for clarifying your process.

My interpretation of why the two haplotypes required different parameters was the different base callers used, as I think you said at some point that one was done with Guppy 4 and one with Guppy 5. But it could also just be a fluke - certainly a possibility given the unusual read distributions, as show in your IGV plot above.

Regarding optimizing assembly parameters, that is certainly a sore point. For human genomes, what I work on most frequently, the standard configurations work quite well. But that is not necessarily the case in general, as each genome each different, and read characteristics also vary widely. That means that manual optimization of assembly parameters is often necessary to get optimal result, a process that requires some knowledge of Shasta computational methods.

Keep in mind the Nanopore-Oct2021 config uses --ReadGraph.creationMethod 2, which internally does some optimization of alignment parameters, and requires running with highly tolerant alignment criteria, which are then internally refined. This means that if you want to do manual optimization of assembly parameters you do need to also set --ReadGraph.creationMethod 0, and then also set reasonable values for all alignment criteria. One process I often use for that is to first do a run with Nanopore-Oct2021. The output of that will include a message summarizing the alignment criteria that were automatically selected. The message will look something like this:

        alignedFraction:        0.305
        markerCount:            155
        maxDrift:               15.5
        maxSkip:                50.5
        maxTrim:                69.5

You can then use these values as starting points for subsequent runs with --ReadGraph.creationMethod 0.

Another thing that is often necessary for optimal assemblies is to increase the number of MinHash iterations, as I have done for your assemblies (--MinHash.minHashIterationCount 100). In this way, the assembly has a better chance of finding most of the correct alignments between reads. Unfortunately, --ReadGraph.creationMethod 2 does not work well when the number of MinHash iterations is increased.

For your assemblies I also reduced the marker length (--Kmers.k) from 14 to 10. I did that because, while exploring the assembly, I noticed that with k=14 the alignments had very low quality, indicating that your reads have lower quality than we normally see with Guppy 5. And of course, changing k requires reoptimizing the alignment criteria, which all operate in marker space.

A tool that is often useful in this process is Bandage. Using Bandage, I was able to see that in the initial assembly with 2 contigs there was a loop caused by a "bridge", so I knew that I had to tighten alignment criteria. Of course if you tighten alignment criteria too much you end up with a fragmented assembly, so there is a trade-off there.

This gives you a bit of an idea of the process I used to optimize parameters for your assemblies. However I understand that this may be too much to ask of a casual user. It does help that assemblies run quickly, especially in your case where the region you are working with is so small. So iterating should not be too painful. If after a few experiments along the above lines you are not successful with your new data, feel free to share them too and I will happy to try it myself.

smf96 commented 2 years ago

Thank you so much for such a thorough explanation!

I have given it a go but unfortunately I'm not really getting any where...


#First tried the standard config:
../../shasta-Linux-0.8.0 --input ../CRA133_2cas9_canu/CRA133_combine2cas9_hap2_FCGRreads.fq --config Nanopore-Oct2021 --assemblyDirectory Shasta_CRA133_2cas9_hap2_STANDARDconfig

#This gave us the automatically selected criteria: alignedFraction 0.445, markerCount 175, maxDrift 13.5, maxSkip 37.5, maxTrim 53.5
#Generated 2 contigs of 44kb and 82kb 

#Now want to try that criteria out:
../../shasta-Linux-0.8.0 --input ../CRA133_2cas9_canu/CRA133_combine2cas9_hap2_FCGRreads.fq --config Nanopore-Oct2021 --assemblyDirectory Shasta_CRA133_2cas9_hap2_STANDARDconfig_plusCRITERIA --Align.minAlignedFraction 0.4 --Align.minAlignedMarkerCount 150 --Align.maxDrift 10 --Align.maxSkip 37 --Align.maxTrim 35 --ReadGraph.creationMethod 0
#Generated 2 contigs of 43kb and 82kb

#Now want to increase the number of MinHash iterations:
../../shasta-Linux-0.8.0 --input ../CRA133_2cas9_canu/CRA133_combine2cas9_hap2_FCGRreads.fq --config Nanopore-Oct2021 --assemblyDirectory Shasta_CRA133_2cas9_hap2_STANDARDconfig_plusCRITERIA_MinHash100 --Align.minAlignedFraction 0.4 --Align.minAlignedMarkerCount 150 --Align.maxDrift 10 --Align.maxSkip 37 --Align.maxTrim 35 --ReadGraph.creationMethod 0 --MinHash.minHashIterationCount 100
#Generated 2 contigs of 66kb and 82kb 

#Next want to try reducing marker length, but this will change the criteria so need to go back and check what thats gone to:
../../shasta-Linux-0.8.0 --input ../CRA133_2cas9_canu/CRA133_combine2cas9_hap2_FCGRreads.fq --config Nanopore-Oct2021 --assemblyDirectory Shasta_CRA133_2cas9_hap2_STANDARDconfig_kmer10 --Kmers.k 10  
#This gave us the new criteria of: alignedFraction 0.535, markerCount 265, maxDrift 10.5, maxSkip 27.5, maxTrim 41.5
#Generated 2 contigs of 59kb and 82kb

#Try new criteria out:
../../shasta-Linux-0.8.0 --input ../CRA133_2cas9_canu/CRA133_combine2cas9_hap2_FCGRreads.fq --config Nanopore-Oct2021 --assemblyDirectory Shasta_CRA133_2cas9_hap2_STANDARDconfig_kmer10_plusCRITERIA --Align.minAlignedFraction 0.5 --Align.minAlignedMarkerCount 260 --Align.maxDrift 10 --Align.maxSkip 27 --Align.maxTrim 41 --ReadGraph.creationMethod 0
##Generated contigs of 66kb and 82kb

Try new criteria out with increasing number of minHash iterations:
../../shasta-Linux-0.8.0 --input ../CRA133_2cas9_canu/CRA133_combine2cas9_hap2_FCGRreads.fq --config Nanopore-Oct2021 --assemblyDirectory Shasta_CRA133_2cas9_hap2_STANDARDconfig_kmer10_plusCRITERIA_MinHash100 --Align.minAlignedFraction 0.5 --Align.minAlignedMarkerCount 260 --Align.maxDrift 10 --Align.maxSkip 27 --Align.maxTrim 41 --ReadGraph.creationMethod 0 --MinHash.minHashIterationCount 100
#Generated 2 contigs of 66kb and 82kb

I hope I have followed along alright, but as you said it's quite a lot to take in as a casual user

If it's okay with you I have sent you a link to the data for this individual. I have included the hap1.fastq and hap2.fastq which I'm hoping it will be possible to assemble separately into 1 contig each. I was also interested in seeing if you could try the v0.9.0 phased assembly workflow out on the unphased.fastq as I would be really eager to know if we are introducing some sort of bias during the phasing stage?

paoloczi commented 2 years ago

Of course, when I receive the data I will be happy to give it a try. The experiments you ran are entirely reasonable and along the lines I suggested. I also have ways to dig into details of the assembly that can provide more specific suggestions. And can you confirm what base caller was used for this new data set?

I can certainly try phased assembly, but can you clarify unphased.fastq? Is it just simply a concatenation of hap1.fastq and hap2.fastq or something else?

smf96 commented 2 years ago

Amazing thank you!

So this new dataset is again the merging of sequencing data from two different cas9 designs (1 with lots of shorter targets, the other with fewer longer targets) so each haplotype will contain a mixture of reads that have been basecalled with v4.2.3 (first cas9 design performed in March) and v5.0.15 (second cas9 design performed recently).

The unphased.fastq is actually named CRA133_2cas9_FCGRreads.fq in what I've sent. It is not a concatenation of the hap1.fastq and hap2.fastq.

The unphased.fastq is simply the output after I have aligned all my reads to the hg38 ref then extracted the reads that align (fully or in part) to my 200kb locus of interest.

The hap1.fastq and hap2.fastq files are the output after I have aligned all the reads to the hg38 ref, used nanopolish to call SNVs, used WhatsHap to phase the vcf, tag the reads within the alignment .bam which allele they belong to, split the .bam into hap1.bam and hap2.bam and then extract the reads from there.

Does that make sense?

paoloczi commented 2 years ago

Thank you. In the meantime I received the new files and I plan to experiment with them tomorrow.

smf96 commented 2 years ago

Perfect, thank you so much!

paoloczi commented 2 years ago

Here is a summary of my experiments on your second data set.

Hap1

Using the following options

--config Nanopore-Oct2021 --Kmers.k 10

I was able to get a single contig assembled. That single contig covers a range of chr1 180 Kb in length, chr1:161503674-161683381. However the assembled contig is only 97889 bases, which suggests a deletion of approximately 82 Kb. From manual mapping it appears the deletion would result in removal of the entire FCGR2C and FCGR3B from this haplotype. This is a bit hard to believe, but on the other hand it looks like there are many long known CNVs covering this region.

Hap2

For hap2 I was not able to find options that result in a single contig being assembled. All of the assemblies had a loop. This may boil down to lack of a sufficient number of reads long enough to cover the entire area with high similarity, which results in the loop.

However, mapping the two assembled segments to hg38 suggests that most of your test region is covered, and no deletion is present.

Combined dataset

For the combined dataset, I was able to get a partially phased assembly using my current development version. The assembly looks like this:

Here, the green segments are the two haplotypes of each section assembled diploid, and the red segments are the regions assembled haploid.

According to this assembly, the two haplotypes are similar to each other. Mapping the various pieces to hg38 suggests that the two haplotypes cover most of the test region and no deletion is present. This disagrees with the finding from the hap1 assembly. One possible interpretation is that your phasing process was confused by the presence of similar sequence, but this is just a hypothesis.

So overall a pretty inconclusive picture, I am afraid.

smf96 commented 2 years ago

Hello Paolo,

Thank you SO much for doing this and sorry for the late reply, lots going on leading up to christmas and took a while to get my head round it all!  

So the 97.9kb contig produced from hap1 that results in the deletion of FCGR2C and FCGR3B is actually much more like what we were expecting! There are known large CNVs in this locus around that region. Is the reason I got a different contig generated because the parameters I used were optimised for the previous sample? And have you arrived at the optimised parameters for this sample’s hap1 because you went through the aforementioned process? Like did you go through the steps you previously explained and managed to get 1 contig after the fourth stage? e.g in the code I inserted a couple of message ago, you had tried the first 3 options? So you tried the standard config, then tried the automatically selected criteria, then increased the number of MinHash iterations but none of those worked so then you tried reducing the marker length to 10 and that worked so you didn’t need to go any further?   

That is a shame about hap2. Do you have any suggestions about how to improve this, for example would filtering the reads going into the assembly make any sort of improvement? Or is it basically I will need to get back in the lab and generate longer reads for this sample?

  Very interesting that the combined dataset contradicts the individual assembly of hap1. Would you mind explaining a bit more about what you mean by ‘the phasing process was confused by the presence of similar sequence’? I’m still playing around with filtering steps. For example all reads (regardless of baseball quality) are mapped, all mapped reads are used to call SNVs and all SNVs are used to phase. Perhaps introducing some thresholds in these stages would help?

paoloczi commented 2 years ago

Yes, regarding hap1 that was approxinately my process, if I remember. One important ingredient is that I used the Shasta functionality to explore assembly results and noticed that the quality of the alignments was significantly worse than we normally see with Guppy 5 reads. To compensate for that, I decided to reduce the k-mer length. With shorter markers, there will be fewer markers in errors, and therefore better alignments.

Are you sure the "super accuracy" option was used when basecalling these reads with Guppy 5? Without that, you don't get the significant quality improvements that came with Guppy 5. That option results in Guppy running quite a bit slower, but believe me, it is totally worth it given the dramatic improvement in sequence quality.

Regarding the phasing, that was just a hypothesis on my part and is not necessarily correct. My thinking was that the phasing process relies on the assumption that there are two copies of almost identical sequence, and that the differences are due to heterozygosity - differences between haplotypes. But here there is a long region where there are actually four very similar copies, not two. And some differences between the two segmental duplication copies could have effects similar to differences between haplotypes, mixing up the phasing process.

Given all of this, my suggestion for you would be, if possible, to repeat the sequencing using a protocol that gives you longer reads. Your reads had about 35 Kb N50 after discarding the reads shorter than 10 Kb, and these days my colleagues at UCSC routinely get 50 Kb N50 for the standard protocol and 70-100 Kb N50 when using Ultra-Long (UL) protocols. Also, I recommend using the "super accuracy" option when running Guppy 5 if you have not done that already.

That is a very interesting region you are looking at. Good luck with your work.

paoloczi commented 2 years ago

One more thought. If there was really a long deletion, it should be visible as a coverage drop, because the deleted region has only coverage from one haplotype. You could check on that using your existing data.

Also, if the deletion exists, there should be reads that "jump" across the deletion. You could look for those reads. The phased assembly I ran did not find a sufficient number of them to decide that a haplotype with the deletion existed, but you may have better luck looking for such reads with IGV or other tools.

smf96 commented 2 years ago

Hello Paolo, hope you’ve enjoyed a nice Christmas break!   

Thank you very much for all your help with this, it is indeed an interesting region..albeit frustrating at times. There is a bit of a drop in coverage like you suggested (see below) but it isn’t that clear because of the different cas9 design used plus we know the shorter reads often multiple map due to the regions homology. It would be fantastic if there were reads that mapped across the deletion, I will have a look in IGV.  

I have also just checked and nope it wasn’t the SUP model after all but the HAC model selected instead. I will try re-basecall with the newest available guppy and see if this improves the assemblies that weren’t able to produce 1 contig per haplotype.   

Wow thats pretty impressive that your colleagues routinely get 50kb N50 and I would love to replicate, do you know of a published paper or protocol I can access? Is this with the ligation sequencing kit? I have been wanting to try read-until rather than cas9 for targeting my region but wasn’t sure if my read lengths/throughput with a wgs kit would be good enough.   Cheers!

paoloczi commented 2 years ago

Using the "super accuracy" model could make a difference, but it would probably require more tweaking of assembly parameters. The much higher quality of the reads will permit using tighter alignment criteria, which in turn will make it easier for the assembly to avoid jumps between the two regions. The resulting assemblies should more likely to be linear. But this will require some additional experimentation with assembly parameters. When you have the new reads, if you give me access to them I can redo some assembly experiments to try and help clarify the situation.

I know little about sequencing protocols, but I will ask one of my colleagues to contribute here.

smf96 commented 2 years ago

Hello Paolo,

That would be really super thank you

So I have finally finished rebasecalling all the reads for CRA133 with guppy v6.0.1 super-accurate model. I have sent you a link to the data and would really appreciate your insight about whether it can make a big difference to the assembly of our region and if we should include a filtering step.

I have sent: 1) All the reads that map to the region (to try the phased assembly workflow with the development version of shasta) 2) All haplotype 1 reads 3) All haplotype 2 reads 4) Reads that map to the region that have a quality score of >10 5) Haplotype 1 reads that have a quality score of >10 6) Haplotype 2 reads that have a quality score of >10

paoloczi commented 2 years ago

Thank you - I will experiment with this soon and will let you know what I find.

smf96 commented 2 years ago

Fantastic thank you!

paoloczi commented 2 years ago

It looks like there must have been an error in the creation of some or all of the fastq files. Some or all of the reads appear twice. For example:

grep -n 8022ee27-0a19-4617-8b6e-73141fd29979 CRA133_gupV6_combine2cas9_allReads_hap1_FCGRreads.fq 
1:@8022ee27-0a19-4617-8b6e-73141fd29979
5:@8022ee27-0a19-4617-8b6e-73141fd29979

Shasta does not deal well with this situation as it is not designed to deal with duplicated reads. If you provide properly deduplicated versions of the fastq files I can continue this investigation.

smf96 commented 2 years ago

Oh dear oopsie I'm sorry

Just had a quick look and it's only the unfiltered (all qscores) fastq files that have duplicated read names - I must have concatenated the reads together twice somewhere.

I will go back and correct those, but I think the passQ10 fastq files are all okay?

smf96 commented 2 years ago

Hello Paolo,

Hope you're well and sorry for the delay

I have resent the fastq without duplicates, look forward to hearing what you're able to do with them!

paoloczi commented 2 years ago

Thank you - I will experiment with these data soon.

paoloczi commented 2 years ago

With your latest files, I was able to get a single contig assembled for both hap1 and hap2 as follows:

Using the files containing all the reads (not just the Q10 reads).
With a read length cutoff of 25 Kb.
Using the following assembly options:
Using my current Shasta development version which is about identical to Shasta 0.9.0.

--config Nanopore-Oct2021 --Reads.minReadLength 25000 --Kmers.k 12 --MinHash.minHashIterationCount 100 --Align.minAlignedMarkerCount 800 --Align.minAlignedFraction 0.7 --Align.maxSkip 30 --Align.maxTrim 30 --Align.maxDrift 20 --ReadGraph.creationMethod 0 --ReadGraph.maxAlignmentCount 12

For hap1, I got a single contig, 179672 bases long, that maps with 99.64% identity to hg38 chr1:161503679-161683370, which is 179691 bases of reference. Therefore assembled length is almost identical to reference length, and hap1 is probably free of structural variants relative to hg38.

For hap2, I got a single contig, 157276 bases long, that maps to chr1:161473123-161712306, but with a deletion at chr1:161594880-161649507. So this is a reference range of 239183 bases with a 54627 base deletion. The reference length minus the deletion is 184556 bases but the assembled length is only 157276 bases, so the actual deletion is actually another 27 Kb longer or about 82 Kb (this is because some portions of the assembled segment map to two places in the reference).

Anyway, I will let you sort out the details of the deletion, but these two assemblies say loud and clear that you have a large heterozygous deletion.

If you have problems reproducing there results with Shasta 0.9.0 let me know and I can post my assemblies or share them privately.

An 82 Kb deletion is also consistent with my partial results (see above) on your earlier dataset. In that case the deletion was on hap1.

I will also experiment with Shasta phased diploid assembly on the combined dataset and see if I can confirm the heterozygous SV.

paoloczi commented 2 years ago

Phased assembly is not able to resolve this segmental duplication while also separating the two haplotypes. This is not suprising as Shasta mode 2 (phased) assembly makes a hard assumption that two copies of sequence are present. In a segmental duplication, there are two haplotypes for each copy of the segup, so the total number of copies can be much higher.

Resolving these harder regions while also phasing the haplotypes will be the subject of future work on Shasta.

rusalkaguy commented 2 years ago

@paoloczi Just to add to this thread - we have generated similar data, similar protocol, and had also hoped that Shasta would be able to assemble our seg dup, in order to study structural variation, as large insertion/duplications are quite common in these seg dups. If there is any future work to support Shasta successfully assembling this type of diploid seg dup, especially with targetted ONT sequencing, we'd love to test it, and can offer a lot of test data.

paoloczi commented 2 years ago

There is some work in progress to improve Shasta's ability to resolve segmental duplications and other hard regions. But it is still at an early stage. As things develop, I will keep in mind your offer to participate in the testing, and thank you for that.

Keep in mind, however, that Shasta is not designed to do well with targeted sequencing, because of the high variability in coverage that usually occurs in the target regions.

chanzuckerberg / shasta