bcgsc / LongStitch

Correct and scaffold assemblies using long reads
GNU General Public License v3.0
45 stars 7 forks source link

Output files interpretation #79

Closed gforg34 closed 5 months ago

gforg34 commented 6 months ago

Hi once again @lcoombe ,

So since longstitch worked for me, the output files were difficult to interpret. To give you more details, my reference genome is a tetraploid (plant - 14 chromosomes), with many unplaced scaffolds(~148k) and uncharacterized regions (NNNs). Moreover, recently I generated long-read sequencing data for this particular species, but different strain/variety.

This led me to longstitch since I though it kinds of does both things. Uses the gap filling option from the ntLink to fill the gaps by using the long read data and also uses the scaffolding with tigmint and arcs to place unplaced scaffolds in the genome. If I understand correctly.

However, the scaffolding did not seem to be what I was expecting since the output was that chromosomes were split into multiple scaffolds, that are renamed. So I started with around 148k scaffolds and now I have around 250k. Although, some chromosomes were split into scaffolds, strangely enough 3 of them were stitched back or they never split apparently. Moreover, I was expecting a stitching of the scaffolds at the end, but I don't know if this is not part of the longstitch process or even LINKS.

Example below FINAL OUTPUT file (grep headers): scaffold1,2, and 3 are three complete chromosome (see size) then the rest are split into scaffolds:

>scaffold1,777835607,f227059Z777835607
>scaffold2,747227478,f217998Z747227478
>scaffold3,724204431,f208933Z724204431
>scaffold4,3781542,f139618Z3344213k10a0m3504_f194285z376997k10a0m3140_f194289z60332
>scaffold5,3032859,f80929Z3032859
>scaffold6,3052659,f181276z90774k4a0m7738_f162352z149662k6a0m5387_f2063Z2812223
>scaffold7,2921516,f2230Z2733579k7a0m5387_f139634z187937
...

Let me know what you think about this situation and what would you recommend to do. I was thinking probably adjusting the LINKS' or ARCS' parameters for stitching the scaffolds back to chromosomes, since they were the last steps before the final output.

Note1: Also I generated some statistics for >scaffold1,777835607,f227059Z777835607 which corresponds to a whole chromosome, and by comparing to the original one, I received exactly the same results in terms of N counts, size etc Note2: I am thinking changing the k-mer and window size, since it is recommended for optimization, the aforementioned results were generating using k_ntlink=32 and w=100, what do you think about increasing this?

lcoombe commented 6 months ago

Hi @gforg34,

Just want to ask a couple of follow-up questions so I make sure that I understand your data and objective correctly!

What is the contiguity of your input assembly? Is it the assembly sequences organized into chromosomes and unlocalized scaffolds? Or just scaffolds that you want to contiguate to work towards chromosome-level?

Also, just making sure I understand - your baseline assembly is a different strain than the long reads that you are using with longstitch?

Thanks! Lauren

gforg34 commented 6 months ago

Hi @lcoombe ,

So yes of course so I can share some stats with you, These are the genome assembly stats of mine:

sum = 10677097747, n = 148296, ave = 71998.56, largest = 865950040  
N50 = 747227478, n = 7  
N60 = 724204431, n = 9  
N70 = 715386202, n = 10  
N80 = 684047826, n = 12  
N90 = 633698003, n = 13  
N100 = 384, n = 148296  
N_count = 353604870  
Gaps = 344589

Yes, the assembly is organized into chromosomes and unlocalized scaffolds. so imagine is like this

>chromosome1
ATGC..
>chromosome2
ATGG..
...
>chromosome14 
AAA..
>unlocalized_scaffold1
TTT...
>unlocalized_scaffold2
...
>unlocalized_scaffold148k

And finally yes, my assembly is a different strain than the long-reads that I am using. but from the same region, so we expect them to be very similar.

Also I did change the k-mers (k-mer=32) and the window size (w=250)and and still the output was exactly the same, so I don't think this change anything in terms of contiguity. Also the last thee chromosomes remained intact again.

lcoombe commented 6 months ago

Hi @gforg34,

Thanks so much for clarifying!

So I guess your situation isn't really the expected one for LongStitch input. Generally, LongStitch inputs will be a draft assembly with reads, and we utilize the reads to correct and scaffold the assembly. The first step in the pipeline is Tigmint-long, which leverages the reads to cut the input assembly at putative misassemblies. Then, ntLink (and arks-long, optionally) will scaffold those pieces using the same long reads. And, like you mention, ntLink can also fill gaps.

So, you are seeing some chromosomes being broken into smaller pieces if Tigmint is unable to find long-read support over regions of those sequences - resulting in Tigmint breaking those sequences. In these cases, I'd expect that it is simply due to large gaps or repeats in the chromosomes.

The limitations here for LongStitch is that it will only attempt to scaffold pieces together, it will not try to insert sequence into already existing gaps. The ntLink gap-filling only fills gaps that are generated by ntLink scaffolding. So, as-is, if you do not break the chromosomes at all, it will not be able to place the unlocalized scaffolds into the gaps. If you wanted to try and place the unlocalized scaffolds, you might have to do something like break the chromosomes at the gaps, retaining information in the headers which let you know the order that they were in. Then, you could try running LongStitch, and see if some unlocalized sequences are scaffolded in the sequences. At the end, you could 'recover' chromosomes using the existing knowledge about the order of the constituent pieces.

I would say, though - I have never tried that before, and it may or may not work. It is also not always advised to 'mix' genomic data from different strains to a simple assembly - since then, the assembly becomes a bit of a mosaic. That being said, it really depends on your particular research question, so you're the best judge of that!

I hope that makes sense - let me know if you have any questions! Lauren

gforg34 commented 6 months ago

Hi @lcoombe ,

Thanks for your reply and suggestions on this. However I have some further questions on your suggestions. 1) First of all, yes my reference genome is a draft one, since it contains a lot of uncharacterized regions (NNNs) and it is quite repetitive. Moreover, if I understand correctly, Tigmint splits the chromosomes in the putative missassemblies/uncharacterized regions/gaps and then ntLink will scaffold those species, both using long-read data. However, this happens because the long-read data cannot cover these uncharacterized areas well enough, right? meaning that the long reads are not sufficient to cover these areas completely that's why the stitching does not occur later on? Or did I miss something here?

2) Secondly, regarding your suggestions for the unlocalized scaffolds, so you mean my best shot in this case is to break the chromosomes by keeping the headers and then run longstitch, right? but is the first part also part of the longstitch pipeline, is this what does tigmint and ntlink do more or less or you mean something else? Or did I misunderstood here? Let me know what you think in this case. Thank you once more for your time dealing with this.

lcoombe commented 6 months ago

Hi @gforg34,

  1. Yes, Tigmint will split the assembly at regions that are not well-supported by the input long reads. When we don't have good support of a particular assembly region from the long reads, we take that as evidence that this region may be misassembled. When using a chromosome-scale assembly with many large gaps, a caveat there is that it may also break at very long gaps, simply because for long gaps, there may not be reads that are long enough to span that gap region (thus providing support). So, if Tigmint breaks at these very long gap regions, but there are no long reads that can span that region, then ntLink may not have the long-read information to scaffold that. It depends on what was used to generate the chromosome-level assembly - for example, using Hi-C data gives longer-range information than long reads can provide.

  2. The important distinction is that Tigmint will not necessarily break the assembly at the gaps. If there is good long-read support over a gap, then it will leave it intact. So, if you are hoping to fill in those gaps, you wouldn't be able to solely rely on Tigmint breaking the sequences there - it may break at some gaps, but this is more a side-effect of the gap length vs. it being designed to target those regions.

I hope that makes sense - let me know if you have any follow-up questions! Lauren

gforg34 commented 6 months ago

Hi @lcoombe ,

Thanks again for your reply and feedback on this. So in my case, having 9x coverage and HI-C long read data a repetitive genome and a large genome (10Gbp), we can say that is not enough support to cover big gaps of over 100kbp or so, right? Do you know what is an optimal situation for longstitch in similar cases (large genome, low coverage HI-C or in general the optimum that you have tried and saw major improvement in terms of genome improvement)? To have an idea for potential future applications. Thanks.

lcoombe commented 6 months ago

Hi @gforg34,

Whether there would be support or not would really depend on the length distribution of the reads? I would say it would be unlikely unless you are using ultra-long reads, and even then, you'd need to have good mappings of the reads on either end of the gap, plus multiple very long reads that would support that join.

We have tried the longstitch approach on 20 GB genomes, so we do know that it can work well on larger genomes, although in that case we used higher coverage (>30X) data. That was for the more typical approach - that is, correcting and contiguating a draft assembly to achieve a higher contiguity assembly. Generally, that would be the application that we'd recommend longstitch for!

gforg34 commented 5 months ago

Dear @lcoombe,

Sorry for this late message. Thanks for your reply and support, I would take this advice into account for future usage of longstitch. Since apparently for now it is not the way to go. Thanks again! I will close this issue.

lcoombe commented 5 months ago

You're welcome, @gforg34! That's good feedback to know for us that a more native long-read gap-filler would be helpful for you - I'll think about how we could achieve that in a future release!