bcgsc / LINKS

⛓ Long Interval Nucleotide K-mer Scaffolder
GNU General Public License v3.0
71 stars 15 forks source link

AGP file creation : ARCS's gap mixed with pilon gap #41

Closed shameem356 closed 4 years ago

shameem356 commented 4 years ago

Hello @sjackman @warrenlr ,

First of all I want to say thank for Tigmint+LINKS pipeline. By using Tigmint+LINKS pilpein made an amazing de-novo plant genome assembly.Now we are planing to submit the genome to NCBI.But we are having a problem with AGP file.We tried to create AGP file using abyss-fatoagp program and it could not make correct AGP file .The reason because we have used pilon software with gap insert mode (10bp ) to polish the base assembly (pacbio ) before ARCS pipeline. Then the base assembly ran with TIGMINT and ARCS with four set of 10X data.

The final output is having 10bp 'Ns' from ARCS as well as pilon. This messed up the abyss-fatoagp out put.Please could you give an advise how to solve this issue ? I can explain you in detail the process we have done.

created base assembly with pacbio reads. ran Quiver and pilon for correction and polishing of genome . Pilon ran with gap insert mode. it inserted 10bp gaps. Generated four sets of 10x data : a. Leaf_plg, b. leaf , c. shoot d . shoot_plud Run tigmint with four sets of 10X data recursively ( leaf-plug -> leaf -> shoot -> shoot_plug) Run ARCS with four sets of 10X data recursively ( leaf-plug -> leaf -> shoot -> shoot_plug). Now I want to create AGP file for assembly ????

lcoombe commented 4 years ago

Hi @shameem356,

Is your issue that you want to annotate the gaps from ARCS/Pilon differently for submission? I use abyss-fatoagp for NCBI submissions of assemblies with multiple assembly steps and haven't had any issues.

Can you explain more how the AGP file is 'messed up'?

Lauren

shameem356 commented 4 years ago

Hello Lauren,

Sure I can explain how the AGP file is 'messed up.

There are gaps in base assembly (pacbio) which is inserted by pilon software and this gap length is 10bp.The pilon inserted gaps are not for joining the contigs. The default gap size from ARCS is 10 bp ( joining between contigs). Now I want to make AGP file based on ARCS's inserted gap and should not consider pilon gaps.In my case abyss-fatoagp is making AGP file based on all the gaps in the genome (10bp Ns) which are from ARCS as well as PILON. The out put AGP file is saying that there 63k contigs. The funny things is that total number of contigs in base assembly before ARCS is just only 22K . So ideally there should be only less than 22k contigs information in AGP file .

shameem356 commented 4 years ago

I want to annotate the gaps from ARCS only

lcoombe commented 4 years ago

What version of ARCS did you use? As of v1.0.6, the default gap size was 100bp.

shameem356 commented 4 years ago

I have used arcs 1.0.3

lcoombe commented 4 years ago

Ah ok. Well, I haven't used it myself but have you tried using the scaffoldsTOAGP2.pl script in this repo? It takes the .scaffolds file from the LINKS step of the ARCS pipeline as input

shameem356 commented 4 years ago

Well, I tried scaffoldsTOAGP2.pl. It could not solve the problem completely. Because I mentioned before that we have four sets of 10x data and ran ARCS recursively one another one. In each ARCS iteration contigs joined to form scaffold. So the final AGP file should have all the contigs information which joined to form scaffold .Ideally all four ARCS run has to trace back and find how the scaffold formed. I was able to map all the contigs for each scaffold. but could not order it properly .

lcoombe commented 4 years ago

Ah yes I see. Unfortunately I don't have a great suggestion for how to distinguish the pilon gaps and the ARCS gaps since they are of identical size.

I suppose you could try extracting the gap + flank sequences from the post-Pilon assembly, and use those sequences to scan through your assembly. If the gap+flank matches that set, change the gap to a distinct size. You could edit abyss-fatoagp to only recognize gaps that are not that size?

As an alternative (I know you probably wouldn't want to do this), you could re-do the ARCS runs using a different gap size that is distinct from the Pilon gaps.

Sorry I don't have any easier solutions! It is difficult to go backwards to distinguish when different gaps were created.

sjackman commented 4 years ago

Note the NCBI GenBank accepts assemblies with gaps now. See Gapped Format for Genome Submissions https://www.ncbi.nlm.nih.gov/genbank/wgs_gapped/