bcgsc / LongStitch

Correct and scaffold assemblies using long reads
GNU General Public License v3.0
47 stars 7 forks source link

Longstitch with PacBio #64

Closed francicco closed 1 year ago

francicco commented 1 year ago

Hi,

I'm testing Longstitch with PacBio data. This is how I execute the analysis: longstitch tigmint-ntLink-arks draft=DraftGenome reads=m64147e_230220_093703.hifi_reads G=200000000 w=150 k_ntLink=24 longmap=hifi

At some point ARCS says: =>Reading Chromium FASTQ file(s)... is this normal or I have to specify other options?

it seems to find barcodes:

=>Preprocessing: Gathering barcode multiplicity information...Wed Mar  8 13:33:24 2023
Saw 2113237  distinct barcodes.

which is a bit odd...

Any help? Thanks a lot F

lcoombe commented 1 year ago

Hi @francicco,

The arks-long (and tigmint-long) steps in the LongStitch pipeline work by preprocessing the long reads to generate 'pseudo-linked reads'. You'll see this step called long-to-linked-pe - it allows us to utilize our Tigmint and ARCS tools which were originally developed to use linked reads to also use long reads. The pseudo-linked reads are then mapped to the draft assembly and the ARKS step proceeds as normal. Hence, you'll see those log messages.

Let me know if you have any other questions - thank you for your interest in LongStitch! Lauren

francicco commented 1 year ago

Hi @lcoombe,

Thank you very much. That makes sense.

A follow-up question. Executing the command I've sent you the final assembly seems to have a better contiguity but the BUSCO score drops significantly. It goes from 96.8% complete single/2.6% missing to 82.5% and 17%, respectively. There's clearly something that is not working as it should. Any idea why?

Thanks a lot Francesco

lcoombe commented 1 year ago

Hi @francicco,

It would be helpful to look at the statistics at each stage of LongStitch (ex. post-Tigmint, post-ntLink, post-ARKS). That could be informative to understand where the BUSCO drop is happening. Sometimes Tigmint can over-cut the assembly, for example.

francicco commented 1 year ago

Yes, that was my plan, my feeling was exactly that.

What would be the command to just do the scaffolding? Also, is there any gap filling procedure implemented in longstich already?

Thanks a lot F

benyoung93 commented 1 year ago

Good morning :)

I had a couple of queries linked to this thread so thought I would post them here rather than adding a new query. Disclaimer, I am also new to the old genome assembly so I apologise for any stupid/noob queries.

@lcoombe Firstly this tool looks fantastic. I have a nice clean hifi assembled de novo genome for my non model organism (N50 ~30 million, L50 8, length ~500 million which is what is expected). I am at the scaffolding stage but resources for my organism are limited (im currently seeing if i can swing some hic sequecing). Could i theoretically use your tool to take my assembly, and then input the original hifi reads to try and do the scaffolding? I have been googling this and am not sure if this is a big no no, or an acceptable approach.

@francicco how was your input draft assembly assembled? Is it the one assembled from hifi reads or an older version (kind of linking this back to query above). I do have a really, really, really bad old reference genome for the speceis i work in, so I may try and see if i can use my hifi reads to improve this one.

Thank you for the help in advance, I appreciate any and all comments :).

Ben

francicco commented 1 year ago

Hi @benyoung93,

Yes, HiFi data assembled with HIFIasm F

lcoombe commented 1 year ago

Hi @francicco

What would be the command to just do the scaffolding?

To only run the scaffolding (without Tigmint misassembly correction), you can just replace your target (currently tigmint-ntLink-arks) with ntLink-arks

Also, is there any gap filling procedure implemented in longstich already?

The ntLink scaffolding step can perform gap-filling - you can turn this on with gap_fill=True. Just make sure that you are using LongStitch v1.0.3+ and ntLink v1.2.0+. A couple of notes about the gap-filling feature - this will only attempt to fill gaps that ntLink itself creates through scaffolding. Also, it will currently fill the gaps with raw read sequence, so you may consider downstream polishing, although it's less of a concern with your accurate HiFi reads.

lcoombe commented 1 year ago

Hi @benyoung93,

Thank you for your kind words!

Could i theoretically use your tool to take my assembly, and then input the original hifi reads to try and do the scaffolding?

Yes, this is indeed one of the most common uses of LongStitch - to improve upon a long read assembly with the same long reads! We have an example in our LongStitch paper, where we assemble human long reads using Shasta and improve upon that Shasta assembly with the same long reads.

Thank you for your interest in LongStitch - and feel free to open new issues if you have any questions! I'm always happy to help, and enjoy hearing from our users. Lauren

benyoung93 commented 1 year ago

I am currently reading the paper lol. Super interesting stuff.

I was actually just trying to use ntlink, and then I realised that it is part of the longstitch pipeline so I am going to have a look at the difference in outputs between ntlink, longstitch + nt, longstitch + nt + ARKs.

I was having some issues with the old conda installation, but I see it is a python version from another issue so hopefully that will fix everything.

Thank you so much for the quick reply i really appreciate it :).

Ben

lcoombe commented 1 year ago

@benyoung93 - Awesome!! Yes, don't hesitate if you have any questions about the different options for using our tools or if you have lingering installation issues!

francicco commented 1 year ago

Hi @lcoombe,

I tried without tigmint, the results are very good actually:

Parsing DraftGenome.k32.w100.ntLink-arks.longstitch-scaffolds.fa


N50: 9,174,343
N90: 2,216,306
Number of contigs  : 68
Longest contig     : 16713kb
Genome Length (GL) : 130321118
GL without N/X     : 128812966

I wander if I should skip the tigmint part.

What would you do? Cheers F

lcoombe commented 1 year ago

Ok great!

It really depends on the situation. If you are finding that running Tigmint is in the end detrimental to the assembly (ie. it's cutting too much), then I think it's just fine to keep it out of the equation. The downside is that you will not cut at any putative misassemblies, but in this case, that may be OK given that you have a high % complete BUSCO in your baseline assembly.

You could also specify more stringent parameters (ex. span=2) for the Tigmint step which should reduce the number of cuts, but I know that playing around with these things takes time, which you might want to avoid.

francicco commented 1 year ago

Thanks a lot! I'll give few tries! Cheers F

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your interest in LongStitch!