scaffolding with paired end short reads

mesamuels commented 2 years ago

What is happening at the scaffolding step, when the only input are two paired-end fastq Illumina files? According to the publication on Abyss v2, the information about paired ends of the same clusters is used in going from unitigs to contigs, and going to scaffolds requires additional data such as mate pair or chromosome mapping. I don't have such information, but the scaffold files (xyz-8.fa,dot,path) are slightly different than the contig files (xyz-6.fa,dot,path).

According to FastP and FastQC, most of my individual paired end reads overlap at the 3' ends, as the library fragment average insert was 218 bp and I did 2x150 reads. But 23% of the read pairs did not overlap. Is that the extra information at the scaffolding stage, as from the publication it seems like that was already used at the contig stage?

Just to note, the assemblies seem good, in that many of the unique genes/gene segments I'm testing occur in exactly two assembled contigs/scaffolds, with differences as expected for the two subgenomes. I tried using the RNA transcript information (.fna file) from quinoa to make 'long-scaffolds' and that did significantly increase the lengths. However I don't know if I can trust it, it would seem that exons might be co-scaffolded that came from the two different subgenomes, if there is no actual overlap sequence between them (or no unique sequence). Am I understanding correctly how the RNA information is used?

Thanks! Mark

vlad0x00 commented 2 years ago

Is that the extra information at the scaffolding stage, as from the publication it seems like that was already used at the contig stage?

Yes, that's the extra information. At the contig stage, read pairs are aligned to the output of the previous stage -- unitigs. If two overlapping unitigs have sufficient number of pairs aligned to both of them, they can be joined together into a contig. At the scaffolding stage, we don't necessarily need the contigs to overlap as we can insert a run of N letters instead between them based on the estimated distance.

However I don't know if I can trust it, it would seem that exons might be co-scaffolded that came from the two different subgenomes, if there is no actual overlap sequence between them (or no unique sequence). Am I understanding correctly how the RNA information is used?

I will look into this.

vlad0x00 commented 2 years ago

However I don't know if I can trust it, it would seem that exons might be co-scaffolded that came from the two different subgenomes, if there is no actual overlap sequence between them (or no unique sequence).

You're right that this is a concern. We've noticed increased misassemblies after scaffolding with RNA sequences and so in the past we used a misassembly correction tool afterwards (https://github.com/bcgsc/tigmint). This tool however requires either linked or long reads.

mesamuels commented 2 years ago

Thanks Vlad, for confirming my thinking about the scaffolding. It is a useful step, as FastP noted that in 23% of the individual read-pairs, there was no sequence overlap so the inserts were bigger than 300bp. It wasn't clear if that extra information was included already in the contig step, but since not, the scaffolding is useful and the results make sense.

FYI, I have checked a number of genes now for my assembly, using the annotations of the quinoa genome as a reference since my species is very close. As desired, for good pseudo-unique quinoa genes, there are hits on two contigs in my Abyss assembly. This is expected as the species is also tetraploid formed by hybridization of two similar diploid species. So every gene should have two copies. In an assembly that combined the two subgenomes incorrectly, there would only be one contig hit, indicating a problem. In an assembly that was too fragmented, there would be many hits. That said, some genes appear to be split across multiple contigs, probably because there are repeat sequences in some introns that Abyss cannot assemble across. But individual exons themselves appear well assembled.

This is about as good as I would expect to get, given the lack of any larger mate-pairs or long reads. The nice thing is, if I get more funding I can just do some PacBio sequencing on the same DNA sample and combine that information in for an improved assembly. So this was all extremely useful as a learning experience for me.

Just BTW, I was able to get Platanus running, a different assembler. It appears to give about the same results as Abyss, at least with this data set. Exactly how well the two assemblies compare for accuracy is hard to assess given the lack of long reads. But it's very nice to have confirmation, the two assemblies validate each other!

Best wishes, Mark

bcgsc / abyss

scaffolding with paired end short reads #449