PacificBiosciences / apps-scripts

Miscellaneous scripts for applications of PacBio systems
Other
25 stars 11 forks source link

losing lots of haplotigs #2

Open dcopetti opened 7 years ago

dcopetti commented 7 years ago

Hello, I am using this script on an Unzipped assembly (1.3 Gb in total) with default settings (50% of bases to be out). The resulting fasta has many less sequences (5449 vs 7905) and only 65 Mb less (1295 Mb now). I would be fine with this so few data less, but I see that in some cases the sequences to be dropped are the majority of haplotigs of a primary contig and have very low coverage. This is an example:

000198F .       region  1       100000  0.00    +       .       cov=2,30,45;cov2=30.526,7.517;gaps=0,0
000198F .       region  100001  200000  0.00    +       .       cov=7,40,58;cov2=35.509,15.567;gaps=0,0
000198F .       region  200001  300000  0.00    +       .       cov=23,39,63;cov2=39.233,7.091;gaps=0,0
000198F .       region  300001  400000  0.00    +       .       cov=19,34,51;cov2=34.301,7.204;gaps=0,0
000198F .       region  400001  500000  0.00    +       .       cov=22,32,50;cov2=33.162,5.386;gaps=0,0
000198F .       region  500001  600000  0.00    +       .       cov=20,30,47;cov2=30.353,5.867;gaps=0,0
000198F .       region  600001  700000  0.00    +       .       cov=27,45,62;cov2=44.970,6.577;gaps=0,0
000198F .       region  700001  800000  0.00    +       .       cov=23,36,60;cov2=36.940,6.550;gaps=0,0
000198F .       region  800001  900000  0.00    +       .       cov=29,52,72;cov2=50.448,9.843;gaps=0,0
000198F .       region  900001  1000000 0.00    +       .       cov=15,30,57;cov2=31.219,8.639;gaps=0,0
000198F .       region  1000001 1100000 0.00    +       .       cov=24,40,56;cov2=38.645,7.794;gaps=0,0
000198F .       region  1100001 1200000 0.00    +       .       cov=9,39,60;cov2=36.081,14.970;gaps=0,0
000198F .       region  1200001 1305962 0.00    +       .       cov=0,71,114;cov2=65.461,29.555;gaps=1,1415

Haplotigs I would keep:

000198F_001     .       region  1       100000  0.00    +       .       cov=0,37,60;cov2=37.897,11.139;gaps=1,218
000198F_001     .       region  100001  200000  0.00    +       .       cov=25,45,58;cov2=43.812,6.803;gaps=0,0
000198F_001     .       region  200001  377557  0.00    +       .       cov=1,35,52;cov2=32.771,11.745;gaps=0,0
000198F_005     .       region  1       10000   0.00    +       .       cov=0,5,13;cov2=5.601,4.808;gaps=1,1747
000198F_005     .       region  10001   20000   0.00    +       .       cov=12,18,23;cov2=17.811,3.420;gaps=0,0
000198F_005     .       region  20001   30000   0.00    +       .       cov=10,23,27;cov2=20.386,5.018;gaps=0,0
000198F_005     .       region  30001   40000   0.00    +       .       cov=1,2,10;cov2=3.544,2.454;gaps=0,0
000198F_005     .       region  40001   54133   0.00    +       .       cov=0,4,7;cov2=4.076,1.795;gaps=1,736
000198F_007     .       region  1       10000   0.00    +       .       cov=0,16,25;cov2=15.050,6.948;gaps=1,3
000198F_007     .       region  10001   20000   0.00    +       .       cov=24,38,44;cov2=36.142,5.420;gaps=0,0
000198F_007     .       region  20001   30000   0.00    +       .       cov=26,36,40;cov2=33.973,4.735;gaps=0,0
000198F_007     .       region  30001   44674   0.00    +       .       cov=1,12,31;cov2=15.004,9.265;gaps=0,0

Haplotigs to remove:

000198F_002     .       region  1       5000    0.00    +       .       cov=0,5,6;cov2=4.346,1.405;gaps=1,245
000198F_002     .       region  5001    10000   0.00    +       .       cov=4,7,9;cov2=6.964,1.279;gaps=0,0
000198F_002     .       region  10001   15000   0.00    +       .       cov=2,3,4;cov2=2.851,0.626;gaps=0,0
000198F_002     .       region  15001   23832   0.00    +       .       cov=0,1,2;cov2=1.151,0.501;gaps=1,543
000198F_003     .       region  1       10000   0.00    +       .       cov=0,23,33;cov2=19.876,9.618;gaps=1,89
000198F_003     .       region  10001   20000   0.00    +       .       cov=21,32,39;cov2=31.287,4.271;gaps=0,0
000198F_003     .       region  20001   30000   0.00    +       .       cov=2,4,21;cov2=7.073,5.898;gaps=0,0
000198F_003     .       region  30001   40596   0.00    +       .       cov=1,3,6;cov2=3.109,1.130;gaps=0,0
000198F_004     .       region  1       5000    0.00    +       .       cov=0,1,2;cov2=0.667,0.668;gaps=3,2225
000198F_004     .       region  5001    10000   0.00    +       .       cov=2,4,6;cov2=4.150,1.239;gaps=0,0
000198F_004     .       region  10001   15000   0.00    +       .       cov=5,6,7;cov2=6.286,0.455;gaps=0,0
000198F_004     .       region  15001   22055   0.00    +       .       cov=0,5,7;cov2=4.119,2.193;gaps=1,818
000198F_006     .       region  1       5000    0.00    +       .       cov=0,0,1;cov2=0.459,0.498;gaps=1,2707
000198F_006     .       region  5001    10000   0.00    +       .       cov=1,1,2;cov2=1.060,0.237;gaps=0,0
000198F_006     .       region  10001   15000   0.00    +       .       cov=2,3,5;cov2=2.937,0.773;gaps=0,0
000198F_006     .       region  15001   20148   0.00    +       .       cov=0,2,3;cov2=1.702,1.286;gaps=1,1680
000198F_008     .       region  1       5000    0.00    +       .       cov=0,10,15;cov2=10.068,2.809;gaps=1,72
000198F_008     .       region  5001    10000   0.00    +       .       cov=15,18,21;cov2=18.249,1.234;gaps=0,0
000198F_008     .       region  10001   15000   0.00    +       .       cov=7,12,17;cov2=11.872,2.082;gaps=0,0
000198F_008     .       region  15001   20000   0.00    +       .       cov=2,3,7;cov2=4.031,2.002;gaps=0,0
000198F_008     .       region  20001   29084   0.00    +       .       cov=0,1,4;cov2=0.790,1.003;gaps=3,4217

Which makes sense - the dropped haplotigs have very low coverage. But why were they assembled, then? if they are just spurious reads, how did they even make it through the assembly (run with min_cov=2 or 4)?

At the same time, I would remove 171 primary contigs, that usually have high ctg ID number and low coverage again:

001665F .       region  1       10000   0.00    +       .       cov=0,14,16;cov2=11.141,4.683;gaps=1,232
001665F .       region  10001   20000   0.00    +       .       cov=7,12,15;cov2=11.800,2.034;gaps=0,0
001665F .       region  20001   30000   0.00    +       .       cov=4,6,8;cov2=5.690,1.085;gaps=0,0
001665F .       region  30001   44399   0.00    +       .       cov=1,7,11;cov2=7.110,2.019;gaps=0,0

or better this:

001527F .       region  1       10000   0.00    +       .       cov=1,13,45;cov2=18.162,15.221;gaps=0,0
001527F .       region  10001   20000   0.00    +       .       cov=12,27,34;cov2=25.351,5.830;gaps=0,0
001527F .       region  20001   30000   0.00    +       .       cov=2,8,12;cov2=6.611,3.566;gaps=0,0
001527F .       region  30001   40000   0.00    +       .       cov=1,3,5;cov2=3.303,1.101;gaps=0,0
001527F .       region  40001   53128   0.00    +       .       cov=0,1,4;cov2=1.102,1.374;gaps=2,6331

how about instead trim the regions with lowercase bases? In this last contig, the ~25 kb of well covered sequence may be a real (allelic) contig. Or, did you try seeing if these low coverage regions are "resurrected" to better quality after aligning Illumina reads and running Pilon? (I will try that now) Thanks, Dario

skingan commented 7 years ago

Hi Dario, "Which makes sense - the dropped haplotigs have very low coverage. But why were they assembled, then? if they are just spurious reads, how did they even make it through the assembly (run with min_cov=2 or 4)?"

It's possible that these haplotigs represent artifacts or misassemblies, possibly driven by repeats, but its not something we have looked into in detail at this point. For the example you gave, the 5 removed haplotigs for 000198F are shorter than the retained 3. It would be interesting to see how all of the haplotigs align to the primary. You could try my bash script which uses samtools and mummer to generate alignments of all haplotigs to their primary contig and then plot in assemblytics. My hunch is some of the shorter haplotigs are nested within 000198F_001.

Regarding min_cov, this can refer to the preassembly process (falcon_sense_option) which determines the depth of raw read coverage on seed reads required to call consensus versus split the seed read. But "min_cov" is also used in layout filtering and refers to the number of overlaps between preads. So neither of these parameters really capture raw read depth on contigs. Its also worth noting that the raw read coverage you are looking at is from the polishing process which is totally distinct from the process used by FALCON-Unzip, but is still useful for assessing raw support in the assembly.

"how about instead trim the regions with lowercase bases? In this last contig, the ~25 kb of well covered sequence may be a real (allelic) contig. Or, did you try seeing if these low coverage regions are "resurrected" to better quality after aligning Illumina reads and running Pilon? (I will try that now)"

I would be concerned about trimming LC bases in the middle of contigs. Also, it is normal for raw read coverage to drop at the end of contigs so if removing LC sequence at contig ends, I would worry you would lose a lot of sequence. If you polish with Pilon, be sure to use a "random best" mapping strategy to avoid multiply mapped reads. Would be interested in your results.

If you are concerned about overly aggressive filtering, you could map transcripts to the contigs and only remove those that both have low coverage AND no genes.

Sarah

dcopetti commented 7 years ago

Thanks Sarah, I run contig 198 and I got a mixed result: like you thought, haplotigs 2, 3 and 6 are contained in 1. But 4 and 8 - that will be discarded for the lowercase/coverage issue - are on a different region of the primary contig. Trying to save them, we could edit the lowercase fraction to keep, but that will save only haplotig 8, #4 is all its length around coverage 4-5. So unless we manually inspect all the contigs, I guess there is not a straightforward way to split clearly real haplotigs from artifacts.

I agree on not trimming lowercase regions inside a contig, and after Pilon I see that the fraction of lowercase bases in all contigs is very low (mostly below 1%, sometimes up to 9%). But I don't know how Pilon assigns ATGC vs atgc.

Lastly, keeping a contig just because it has a gene does not seem a good criterion for me, it may create "fake paralogs" with deep biological implications. Unless that gene is unique in the assembly, probably.

If biology was perfect, it would be so boring....

image