dereneaton / ipyrad

Interactive assembly and analysis of RAD-seq data sets
http://ipyrad.readthedocs.io
GNU General Public License v3.0
72 stars 40 forks source link

Clustering improves when vsearch option -leftjust is disabled in ddrad data #193

Closed edgardomortiz closed 7 years ago

edgardomortiz commented 7 years ago

Roughly half the loci were dropped in this dataset because of the duplicates filter, so when comparing how the clustering was made in step 3 and in step 6 the only major difference I could observe was the use of the option -leftjust in cluster_within.py.

Here is a little comparison made by disabling that option and reanalyzing this dataset:

-leftjust enabled:

                            total_filters  applied_order  retained_loci
total_prefiltered_loci              56798              0          56798
filtered_by_rm_duplicates           24978          24978          31820
filtered_by_max_indels               1106            385          31435
filtered_by_max_snps                 3092            578          30857
filtered_by_max_shared_het              6              2          30855
filtered_by_min_sample              37889          22568           8287
filtered_by_max_alleles              8009           1552           6735
total_filtered_loci                  6735              0           6735

           sample_coverage         locus_coverage  sum_coverage         var  sum_var   pis  sum_pis
AS1-Aaspe             1618     1                0             0     0   171        0  1814        0
AS1-Bbart              829     2                0             0     1   260      260  1214     1214
AS2-Amatu              599     3                0             0     2   293      846   861     2936
AS2-Bgeni             1546     4             2369          2369     3   347     1887   649     4883
AS2-Btric              495     5             1351          3720     4   376     3391   515     6943
AS2-Fhyps             1556     6              795          4515     5   393     5356   415     9018
AS2-Halie              824     7              557          5072     6   400     7756   318    10926
AS2-Pquad              680     8              364          5436     7   355    10241   240    12606
DIP-Dbarc              779     9              246          5682     8   352    13057   170    13966
DIP-Dcall             1417     10             210          5892     9   325    15982   106    14920
DIP-Dempe              995     11             158          6050     10  292    18902   128    16200
DIP-Deric              521     12             143          6193     11  296    22158    92    17212
DIP-Dglan             1353     13             131          6324     12  292    25662    60    17932
DIP-Dgnid                7     14              78          6402     13  226    28600    49    18569
DIP-Dgood             1612     15              51          6453     14  250    32100    30    18989
DIP-Dgyno              258     16              41          6494     15  226    35490    37    19544
DIP-Dhaen              551     17              53          6547     16  184    38434    11    19720
DIP-Dhart              260     18              37          6584     17  189    41647     9    19873
DIP-Dmeye              859     19              42          6626     18  155    44437    10    20053
DIP-Doxap              357     20              26          6652     19  169    47648     2    20091
DIP-Dpulc             1206     21              26          6678     20  169    51028     4    20171
DIP-Dspin              799     22              15          6693     21  149    54157     0    20171
DIP-Dspj1              520     23              11          6704     22  145    57347     1    20193
DIP-Dspj3              491     24              15          6719     23  125    60222     0    20193
HYB-Dcine              518     25               8          6727     24  101    62646     0    20193
HYB-Dspc2              511     26               6          6733     25  107    65321     0    20193
OUT-Operu             1358     27               1          6734     26   99    67895     0    20193
PIO-Dalve             1470     28               0          6734     27   86    70217     0    20193
PIO-Danti             1531     29               1          6735     28   82    72513     0    20193
PIO-Dapic             1485     30               0          6735     29   71    74572     0    20193
PIO-Dcaya              877     31               0          6735     30   50    76072     0    20193
PIO-Dcine                6     32               0          6735     
PIO-Dcolo             1994     33               0          6735     
PIO-Derio             1619     34               0          6735     
PIO-Dfron             1511     35               0          6735     
PIO-Dglut             3329     36               0          6735     
PIO-Djene              735     37               0          6735     
PIO-Dphyl              665     38               0          6735     
PIO-Drevo              807     39               0          6735     
PIO-Drhom             1563     40               0          6735     
PIO-Drosm              611     41               0          6735     
PIO-Drupe             1977     42               0          6735     
PIO-Dschu              722     43               0          6735     
PIO-Dtenu             1161     44               0          6735

And then I commented the relevant lines in cluster_within.py disabling -leftjust:

                            total_filters  applied_order  retained_loci
total_prefiltered_loci              47691              0          47691
filtered_by_rm_duplicates            4790           4790          42901
filtered_by_max_indels               1045            741          42160
filtered_by_max_snps                 3056           1779          40381
filtered_by_max_shared_het              0              0          40381
filtered_by_min_sample              29709          27194          13187
filtered_by_max_alleles              8182           3498           9689
total_filtered_loci                  9689              0           9689

           sample_coverage         locus_coverage  sum_coverage         var  sum_var   pis  sum_pis
AS1-Aaspe             2185     1                0             0     0   239        0  2481        0
AS1-Bbart             1027     2                0             0     1   346      346  1676     1676
AS2-Amatu              972     3                0             0     2   370     1086  1256     4188
AS2-Bgeni             2103     4             3154          3154     3   469     2493   961     7071
AS2-Btric              868     5             1858          5012     4   479     4409   745    10051
AS2-Fhyps             1975     6             1091          6103     5   536     7089   581    12956
AS2-Halie             1291     7              796          6899     6   523    10227   487    15878
AS2-Pquad             1201     8              552          7451     7   483    13608   358    18384
DIP-Dbarc             1042     9              381          7832     8   477    17424   287    20680
DIP-Dcall             2161     10             339          8171     9   471    21663   207    22543
DIP-Dempe             1508     11             253          8424     10  422    25883   172    24263
DIP-Deric             1117     12             247          8671     11  406    30349   118    25561
DIP-Dglan             1703     13             192          8863     12  407    35233    97    26725
DIP-Dgnid               42     14             149          9012     13  360    39913    76    27713
DIP-Dgood             2497     15              97          9109     14  339    44659    63    28595
DIP-Dgyno              410     16              71          9180     15  323    49504    42    29225
DIP-Dhaen              705     17              74          9254     16  289    54128    31    29721
DIP-Dhart              429     18              73          9327     17  298    59194    17    30010
DIP-Dmeye             1625     19              55          9382     18  269    64036    11    30208
DIP-Doxap              791     20              54          9436     19  230    68406    11    30417
DIP-Dpulc             1576     21              59          9495     20  219    72786     6    30537
DIP-Dspin             1064     22              37          9532     21  251    78057     2    30579
DIP-Dspj1             1126     23              27          9559     22  229    83095     2    30623
DIP-Dspj3             1078     24              39          9598     23  213    87994     1    30646
HYB-Dcine              694     25              30          9628     24  197    92722     1    30670
HYB-Dspc2             1015     26              11          9639     25  177    97147     0    30670
OUT-Operu             1833     27              16          9655     26  168   101515     0    30670
PIO-Dalve             2548     28              12          9667     27  142   105349     0    30670
PIO-Danti             1843     29               5          9672     28  135   109129     0    30670
PIO-Dapic             2530     30               4          9676     29  121   112638     0    30670
PIO-Dcaya             1968     31               3          9679     30  101   115668     0    30670
PIO-Dcine               16     32               2          9681     
PIO-Dcolo             2990     33               0          9681     
PIO-Derio             2624     34               1          9682     
PIO-Dfron             1839     35               0          9682     
PIO-Dglut             4488     36               1          9683     
PIO-Djene             1373     37               1          9684     
PIO-Dphyl             1664     38               1          9685     
PIO-Drevo             1850     39               0          9685     
PIO-Drhom             1977     40               0          9685     
PIO-Drosm             1566     41               2          9687     
PIO-Drupe             2364     42               2          9689     
PIO-Dschu             1824     43               0          9689     
PIO-Dtenu             1521     44               0          9689

The cluster alignments now accept initial gaps which are common due to the new quality filtering that trims bases on both ends and also to allelic variation. I imagine also that the option -leftjust was creating many singletons or even separate loci, at least I can see the number of singletons is reduced in step 3.

dereneaton commented 7 years ago

Oh, nice! Yeah, you can see I left a little note next to the -leftjust argument in the code to indicate that I was unsure whether we should keep using it now that we have enabled left-side edge trimming. This is also the reason we disable it for gbs data which reverse complements so that you expect more left side gaps. Looks like we should just do away with it! Nice find.

-Deren

On 10/07/2016 10:29 AM, Edgardo M. Ortiz wrote:

Roughly half the loci were dropped in this dataset because of the duplicates filter, so when comparing how the clustering was made in step 3 and in step 6 the only major difference I could observe was the use of the option |-leftjust| in |cluster_within.py|.

Here is a little comparison made by disabling that option and reanalyzing this dataset:

|-leftjust| enabled:

|total_filters applied_order retained_loci total_prefiltered_loci 56798 0 56798 filtered_by_rm_duplicates 24978 24978 31820 filtered_by_max_indels 1106 385 31435 filtered_by_max_snps 3092 578 30857 filtered_by_max_shared_het 6 2 30855 filtered_by_min_sample 37889 22568 8287 filtered_by_max_alleles 8009 1552 6735 total_filtered_loci 6735 0 6735 sample_coverage locus_coverage sum_coverage var sum_var pis sum_pis AS1-Aaspe 1618 1 0 0 0 171 0 1814 0 AS1-Bbart 829 2 0 0 1 260 260 1214 1214 AS2-Amatu 599 3 0 0 2 293 846 861 2936 AS2-Bgeni 1546 4 2369 2369 3 347 1887 649 4883 AS2-Btric 495 5 1351 3720 4 376 3391 515 6943 AS2-Fhyps 1556 6 795 4515 5 393 5356 415 9018 AS2-Halie 824 7 557 5072 6 400 7756 318 10926 AS2-Pquad 680 8 364 5436 7 355 10241 240 12606 DIP-Dbarc 779 9 246 5682 8 352 13057 170 13966 DIP-Dcall 1417 10 210 5892 9 325 15982 106 14920 DIP-Dempe 995 11 158 6050 10 292 18902 128 16200 DIP-Deric 521 12 143 6193 11 296 22158 92 17212 DIP-Dglan 1353 13 131 6324 12 292 25662 60 17932 DIP-Dgnid 7 14 78 6402 13 226 28600 49 18569 DIP-Dgood 1612 15 51 6453 14 250 32100 30 18989 DIP-Dgyno 258 16 41 6494 15 226 35490 37 19544 DIP-Dhaen 551 17 53 6547 16 184 38434 11 19720 DIP-Dhart 260 18 37 6584 17 189 41647 9 19873 DIP-Dmeye 859 19 42 6626 18 155 44437 10 20053 DIP-Doxap 357 20 26 6652 19 169 47648 2 20091 DIP-Dpulc 1206 21 26 6678 20 169 51028 4 20171 DIP-Dspin 799 22 15 6693 21 149 54157 0 20171 DIP-Dspj1 520 23 11 6704 22 145 57347 1 20193 DIP-Dspj3 491 24 15 6719 23 125 60222 0 20193 HYB-Dcine 518 25 8 6727 24 101 62646 0 20193 HYB-Dspc2 511 26 6 6733 25 107 65321 0 20193 OUT-Operu 1358 27 1 6734 26 99 67895 0 20193 PIO-Dalve 1470 28 0 6734 27 86 70217 0 20193 PIO-Danti 1531 29 1 6735 28 82 72513 0 20193 PIO-Dapic 1485 30 0 6735 29 71 74572 0 20193 PIO-Dcaya 877 31 0 6735 30 50 76072 0 20193 PIO-Dcine 6 32 0 6735 PIO-Dcolo 1994 33 0 6735 PIO-Derio 1619 34 0 6735 PIO-Dfron 1511 35 0 6735 PIO-Dglut 3329 36 0 6735 PIO-Djene 735 37 0 6735 PIO-Dphyl 665 38 0 6735 PIO-Drevo 807 39 0 6735 PIO-Drhom 1563 40 0 6735 PIO-Drosm 611 41 0 6735 PIO-Drupe 1977 42 0 6735 PIO-Dschu 722 43 0 6735 PIO-Dtenu 1161 44 0 6735 |

And then I commented the relevant lines in |cluster_within.py| disabling |-leftjust|:

|total_filters applied_order retained_loci total_prefiltered_loci 47691 0 47691 filtered_by_rm_duplicates 4790 4790 42901 filtered_by_max_indels 1045 741 42160 filtered_by_max_snps 3056 1779 40381 filtered_by_max_shared_het 0 0 40381 filtered_by_min_sample 29709 27194 13187 filtered_by_max_alleles 8182 3498 9689 total_filtered_loci 9689 0 9689 sample_coverage locus_coverage sum_coverage var sum_var pis sum_pis AS1-Aaspe 2185 1 0 0 0 239 0 2481 0 AS1-Bbart 1027 2 0 0 1 346 346 1676 1676 AS2-Amatu 972 3 0 0 2 370 1086 1256 4188 AS2-Bgeni 2103 4 3154 3154 3 469 2493 961 7071 AS2-Btric 868 5 1858 5012 4 479 4409 745 10051 AS2-Fhyps 1975 6 1091 6103 5 536 7089 581 12956 AS2-Halie 1291 7 796 6899 6 523 10227 487 15878 AS2-Pquad 1201 8 552 7451 7 483 13608 358 18384 DIP-Dbarc 1042 9 381 7832 8 477 17424 287 20680 DIP-Dcall 2161 10 339 8171 9 471 21663 207 22543 DIP-Dempe 1508 11 253 8424 10 422 25883 172 24263 DIP-Deric 1117 12 247 8671 11 406 30349 118 25561 DIP-Dglan 1703 13 192 8863 12 407 35233 97 26725 DIP-Dgnid 42 14 149 9012 13 360 39913 76 27713 DIP-Dgood 2497 15 97 9109 14 339 44659 63 28595 DIP-Dgyno 410 16 71 9180 15 323 49504 42 29225 DIP-Dhaen 705 17 74 9254 16 289 54128 31 29721 DIP-Dhart 429 18 73 9327 17 298 59194 17 30010 DIP-Dmeye 1625 19 55 9382 18 269 64036 11 30208 DIP-Doxap 791 20 54 9436 19 230 68406 11 30417 DIP-Dpulc 1576 21 59 9495 20 219 72786 6 30537 DIP-Dspin 1064 22 37 9532 21 251 78057 2 30579 DIP-Dspj1 1126 23 27 9559 22 229 83095 2 30623 DIP-Dspj3 1078 24 39 9598 23 213 87994 1 30646 HYB-Dcine 694 25 30 9628 24 197 92722 1 30670 HYB-Dspc2 1015 26 11 9639 25 177 97147 0 30670 OUT-Operu 1833 27 16 9655 26 168 101515 0 30670 PIO-Dalve 2548 28 12 9667 27 142 105349 0 30670 PIO-Danti 1843 29 5 9672 28 135 109129 0 30670 PIO-Dapic 2530 30 4 9676 29 121 112638 0 30670 PIO-Dcaya 1968 31 3 9679 30 101 115668 0 30670 PIO-Dcine 16 32 2 9681 PIO-Dcolo 2990 33 0 9681 PIO-Derio 2624 34 1 9682 PIO-Dfron 1839 35 0 9682 PIO-Dglut 4488 36 1 9683 PIO-Djene 1373 37 1 9684 PIO-Dphyl 1664 38 1 9685 PIO-Drevo 1850 39 0 9685 PIO-Drhom 1977 40 0 9685 PIO-Drosm 1566 41 2 9687 PIO-Drupe 2364 42 2 9689 PIO-Dschu 1824 43 0 9689 PIO-Dtenu 1521 44 0 9689 |

The cluster alignments now accept initial gaps which are common due to the new quality filtering that trims bases on both ends and also to allelic variation. I imagine also that the option |-leftjust| was creating many singletons or even separate loci, at least I can see the number of singletons is reduced in step 3.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dereneaton/ipyrad/issues/193, or mute the thread https://github.com/notifications/unsubscribe-auth/AFJUGEN0Chqh4Ne2QtCrJkdr0PMDHDHUks5qxldagaJpZM4KRF8u.

edgardomortiz commented 7 years ago

Cool! Yes, when I read the note I decided to try disabling the option and making the comparison.