Open dcopetti opened 7 years ago
Hi Dario, "Which makes sense - the dropped haplotigs have very low coverage. But why were they assembled, then? if they are just spurious reads, how did they even make it through the assembly (run with min_cov=2 or 4)?"
It's possible that these haplotigs represent artifacts or misassemblies, possibly driven by repeats, but its not something we have looked into in detail at this point. For the example you gave, the 5 removed haplotigs for 000198F are shorter than the retained 3. It would be interesting to see how all of the haplotigs align to the primary. You could try my bash script which uses samtools and mummer to generate alignments of all haplotigs to their primary contig and then plot in assemblytics. My hunch is some of the shorter haplotigs are nested within 000198F_001.
Regarding min_cov, this can refer to the preassembly process (falcon_sense_option) which determines the depth of raw read coverage on seed reads required to call consensus versus split the seed read. But "min_cov" is also used in layout filtering and refers to the number of overlaps between preads. So neither of these parameters really capture raw read depth on contigs. Its also worth noting that the raw read coverage you are looking at is from the polishing process which is totally distinct from the process used by FALCON-Unzip, but is still useful for assessing raw support in the assembly.
"how about instead trim the regions with lowercase bases? In this last contig, the ~25 kb of well covered sequence may be a real (allelic) contig. Or, did you try seeing if these low coverage regions are "resurrected" to better quality after aligning Illumina reads and running Pilon? (I will try that now)"
I would be concerned about trimming LC bases in the middle of contigs. Also, it is normal for raw read coverage to drop at the end of contigs so if removing LC sequence at contig ends, I would worry you would lose a lot of sequence. If you polish with Pilon, be sure to use a "random best" mapping strategy to avoid multiply mapped reads. Would be interested in your results.
If you are concerned about overly aggressive filtering, you could map transcripts to the contigs and only remove those that both have low coverage AND no genes.
Sarah
Thanks Sarah, I run contig 198 and I got a mixed result: like you thought, haplotigs 2, 3 and 6 are contained in 1. But 4 and 8 - that will be discarded for the lowercase/coverage issue - are on a different region of the primary contig. Trying to save them, we could edit the lowercase fraction to keep, but that will save only haplotig 8, #4 is all its length around coverage 4-5. So unless we manually inspect all the contigs, I guess there is not a straightforward way to split clearly real haplotigs from artifacts.
I agree on not trimming lowercase regions inside a contig, and after Pilon I see that the fraction of lowercase bases in all contigs is very low (mostly below 1%, sometimes up to 9%). But I don't know how Pilon assigns ATGC vs atgc.
Lastly, keeping a contig just because it has a gene does not seem a good criterion for me, it may create "fake paralogs" with deep biological implications. Unless that gene is unique in the assembly, probably.
If biology was perfect, it would be so boring....
Hello, I am using this script on an Unzipped assembly (1.3 Gb in total) with default settings (50% of bases to be out). The resulting fasta has many less sequences (5449 vs 7905) and only 65 Mb less (1295 Mb now). I would be fine with this so few data less, but I see that in some cases the sequences to be dropped are the majority of haplotigs of a primary contig and have very low coverage. This is an example:
Haplotigs I would keep:
Haplotigs to remove:
Which makes sense - the dropped haplotigs have very low coverage. But why were they assembled, then? if they are just spurious reads, how did they even make it through the assembly (run with min_cov=2 or 4)?
At the same time, I would remove 171 primary contigs, that usually have high ctg ID number and low coverage again:
or better this:
how about instead trim the regions with lowercase bases? In this last contig, the ~25 kb of well covered sequence may be a real (allelic) contig. Or, did you try seeing if these low coverage regions are "resurrected" to better quality after aligning Illumina reads and running Pilon? (I will try that now) Thanks, Dario