bcgsc / abyss

:microscope: Assemble large genomes using short reads
http://www.bcgsc.ca/platform/bioinfo/software/abyss
Other
313 stars 108 forks source link

Using a reference genome #357

Closed desmodus1984 closed 3 years ago

desmodus1984 commented 3 years ago

Hi,

I would like to know how to parameterize Abyss to use a reference genome. I am having some trouble with MaSurca, so I wanted to try Abyss to see if the Busco Score is higher. I would like to say that I have ~15X corrected ONT reads, and ~65X 100 Pe-reads.

Thanks;

lcoombe commented 3 years ago

Hi @desmodus1984 ,

Just to make sure I understand - you would like to assemble your reads, then use a reference genome to further scaffold the assembly? ABySS is a de novo assembler, so it doesn't have any reference-guided assembly mode. What I would suggest is assembling your short reads with ABySS, then using ntJoin to further scaffold your assembly using the reference genome (ABySS -> ntJoin). If you want to use the ONT reads for scaffolding, you could use LINKS right after the ABySS step. (ABySS -> LINKS -> ntJoin) We are working on a new pipeline in our group for correcting and scaffolding assemblies with long reads so stay tuned for that.

Hope that helps - thank you for your interest in ABySS! Lauren

desmodus1984 commented 3 years ago

Dear Lauren! THANK YOU VERY MUCH FOR POINTING TO LINKS! I have tried other software with not very good BUSCO results, to scaffold the contigs using ONT reads. Before further mentioning, I wanted to point to a part of the Abyss site, which was the point of my question; it mentions this:

abyss-map: map reads to a reference sequence

I have two "references". I am assembling a bat genome, and there is one "chromosome-scale" assembly - which has some extra scaffolds (92) compared to the karyotype (2n = 44), from an specimen from the same genus but different species; ,and there is a fragmented "chromosome-scale assembly" (n= ~5000), from a female, and I am sequencing a male. Which one would you use for reference?

Finally, I have tried to use wtdbg2, and after talking with my adviser and reading through the Issues section, there are a gazzilion parameters to super-fine tune an assembly. So, I would be extremely grateful if you could help me fine-tune my assembly, and suggest me how to optimize the run.

I am doing a bat genome. I used kmergenie to find the best k-mer size. I have short-reads of 100bps, and the optimal k-mer from kmergenie was 63; I ran MaSurca and the optimal k-mer size was 67. Which one would you trust the most?

There a paper about bat genomes, (https://www.nature.com/articles/s41586-020-2486-3#Abs1) the likely genome size is 2GB, kmergenie estimated a 2.5GB, while MaSurca found 2.3GB; and they likely have "low transposable element content". Also, as I have mentioned, I have long-reads, but I have corrected them with Ratatosk. Which ONT do you recommend me to use, corrected or uncorrected? I ran Abyss, with k=63, the total sum of contigs was 2.4GBs, and the smallest contig had a length of 63, mean of 336.3, max of 35,804. Abyss has several modes: de Bruijn graph, Bloom filter de Bruijn graph, and a paired de Bruijn graph. I would like to know which is the one that will generate the best results.

Sorry for bothering and overwhelming you with my analysis,

Thank you very much;

Juan Pablo

lcoombe commented 3 years ago

Hi Juan Pablo,

abyss-map: I think you're mixing up terminology here - I can see that the different meanings of the term 'reference' might be confusing. abyss-map is an aligner, which aligns a 'query' (in this cases, PE or MPET reads) to a 'reference' (In the case of ABySS, the unitig or contig stage of the de novo assembly). It seems like you are confusing that with a 'reference genome'? This aligner is used in the contig and scaffold stages of ABySS to help the program find paths through the de Bruijn graph, so doesn't use a 'reference genome' at any time, only the reads you specify and the intermediate short-read assembly generated by ABySS.

If you do want to use ntJoin as I suggested above, which is a reference-guided scaffolder, I would suggest using the assembly that is most closely related, and structurally similar to the genome you are assembling. It's hard for me to say because I don't know much about bat genomes, but the structural similarity is the most important consideration to make.

Unfortunately, I don't have a huge amount of experience in tinkering with the wtdbg2 assembler's parameters - you'd probably be best to ask the developers of that tool for guidance there.

For assembling your short reads with ABySS, I'd suggest doing a k-mer sweep (ie running ABySS with multiple k-values around the results found by kmer genie), and using the Bloom filter de Bruijn graph mode.

If you have questions about running LINKS or ntJoin, don't hesitate to ask questions in those repositories! Just note that we might not be as responsive as normal until January given that it's getting into the holiday season.

Hope that helps, Lauren

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.