Closed edgardomortiz closed 7 years ago
Oh, nice! Yeah, you can see I left a little note next to the -leftjust argument in the code to indicate that I was unsure whether we should keep using it now that we have enabled left-side edge trimming. This is also the reason we disable it for gbs data which reverse complements so that you expect more left side gaps. Looks like we should just do away with it! Nice find.
-Deren
On 10/07/2016 10:29 AM, Edgardo M. Ortiz wrote:
Roughly half the loci were dropped in this dataset because of the duplicates filter, so when comparing how the clustering was made in step 3 and in step 6 the only major difference I could observe was the use of the option |-leftjust| in |cluster_within.py|.
Here is a little comparison made by disabling that option and reanalyzing this dataset:
|-leftjust| enabled:
|total_filters applied_order retained_loci total_prefiltered_loci 56798 0 56798 filtered_by_rm_duplicates 24978 24978 31820 filtered_by_max_indels 1106 385 31435 filtered_by_max_snps 3092 578 30857 filtered_by_max_shared_het 6 2 30855 filtered_by_min_sample 37889 22568 8287 filtered_by_max_alleles 8009 1552 6735 total_filtered_loci 6735 0 6735 sample_coverage locus_coverage sum_coverage var sum_var pis sum_pis AS1-Aaspe 1618 1 0 0 0 171 0 1814 0 AS1-Bbart 829 2 0 0 1 260 260 1214 1214 AS2-Amatu 599 3 0 0 2 293 846 861 2936 AS2-Bgeni 1546 4 2369 2369 3 347 1887 649 4883 AS2-Btric 495 5 1351 3720 4 376 3391 515 6943 AS2-Fhyps 1556 6 795 4515 5 393 5356 415 9018 AS2-Halie 824 7 557 5072 6 400 7756 318 10926 AS2-Pquad 680 8 364 5436 7 355 10241 240 12606 DIP-Dbarc 779 9 246 5682 8 352 13057 170 13966 DIP-Dcall 1417 10 210 5892 9 325 15982 106 14920 DIP-Dempe 995 11 158 6050 10 292 18902 128 16200 DIP-Deric 521 12 143 6193 11 296 22158 92 17212 DIP-Dglan 1353 13 131 6324 12 292 25662 60 17932 DIP-Dgnid 7 14 78 6402 13 226 28600 49 18569 DIP-Dgood 1612 15 51 6453 14 250 32100 30 18989 DIP-Dgyno 258 16 41 6494 15 226 35490 37 19544 DIP-Dhaen 551 17 53 6547 16 184 38434 11 19720 DIP-Dhart 260 18 37 6584 17 189 41647 9 19873 DIP-Dmeye 859 19 42 6626 18 155 44437 10 20053 DIP-Doxap 357 20 26 6652 19 169 47648 2 20091 DIP-Dpulc 1206 21 26 6678 20 169 51028 4 20171 DIP-Dspin 799 22 15 6693 21 149 54157 0 20171 DIP-Dspj1 520 23 11 6704 22 145 57347 1 20193 DIP-Dspj3 491 24 15 6719 23 125 60222 0 20193 HYB-Dcine 518 25 8 6727 24 101 62646 0 20193 HYB-Dspc2 511 26 6 6733 25 107 65321 0 20193 OUT-Operu 1358 27 1 6734 26 99 67895 0 20193 PIO-Dalve 1470 28 0 6734 27 86 70217 0 20193 PIO-Danti 1531 29 1 6735 28 82 72513 0 20193 PIO-Dapic 1485 30 0 6735 29 71 74572 0 20193 PIO-Dcaya 877 31 0 6735 30 50 76072 0 20193 PIO-Dcine 6 32 0 6735 PIO-Dcolo 1994 33 0 6735 PIO-Derio 1619 34 0 6735 PIO-Dfron 1511 35 0 6735 PIO-Dglut 3329 36 0 6735 PIO-Djene 735 37 0 6735 PIO-Dphyl 665 38 0 6735 PIO-Drevo 807 39 0 6735 PIO-Drhom 1563 40 0 6735 PIO-Drosm 611 41 0 6735 PIO-Drupe 1977 42 0 6735 PIO-Dschu 722 43 0 6735 PIO-Dtenu 1161 44 0 6735 |
And then I commented the relevant lines in |cluster_within.py| disabling |-leftjust|:
|total_filters applied_order retained_loci total_prefiltered_loci 47691 0 47691 filtered_by_rm_duplicates 4790 4790 42901 filtered_by_max_indels 1045 741 42160 filtered_by_max_snps 3056 1779 40381 filtered_by_max_shared_het 0 0 40381 filtered_by_min_sample 29709 27194 13187 filtered_by_max_alleles 8182 3498 9689 total_filtered_loci 9689 0 9689 sample_coverage locus_coverage sum_coverage var sum_var pis sum_pis AS1-Aaspe 2185 1 0 0 0 239 0 2481 0 AS1-Bbart 1027 2 0 0 1 346 346 1676 1676 AS2-Amatu 972 3 0 0 2 370 1086 1256 4188 AS2-Bgeni 2103 4 3154 3154 3 469 2493 961 7071 AS2-Btric 868 5 1858 5012 4 479 4409 745 10051 AS2-Fhyps 1975 6 1091 6103 5 536 7089 581 12956 AS2-Halie 1291 7 796 6899 6 523 10227 487 15878 AS2-Pquad 1201 8 552 7451 7 483 13608 358 18384 DIP-Dbarc 1042 9 381 7832 8 477 17424 287 20680 DIP-Dcall 2161 10 339 8171 9 471 21663 207 22543 DIP-Dempe 1508 11 253 8424 10 422 25883 172 24263 DIP-Deric 1117 12 247 8671 11 406 30349 118 25561 DIP-Dglan 1703 13 192 8863 12 407 35233 97 26725 DIP-Dgnid 42 14 149 9012 13 360 39913 76 27713 DIP-Dgood 2497 15 97 9109 14 339 44659 63 28595 DIP-Dgyno 410 16 71 9180 15 323 49504 42 29225 DIP-Dhaen 705 17 74 9254 16 289 54128 31 29721 DIP-Dhart 429 18 73 9327 17 298 59194 17 30010 DIP-Dmeye 1625 19 55 9382 18 269 64036 11 30208 DIP-Doxap 791 20 54 9436 19 230 68406 11 30417 DIP-Dpulc 1576 21 59 9495 20 219 72786 6 30537 DIP-Dspin 1064 22 37 9532 21 251 78057 2 30579 DIP-Dspj1 1126 23 27 9559 22 229 83095 2 30623 DIP-Dspj3 1078 24 39 9598 23 213 87994 1 30646 HYB-Dcine 694 25 30 9628 24 197 92722 1 30670 HYB-Dspc2 1015 26 11 9639 25 177 97147 0 30670 OUT-Operu 1833 27 16 9655 26 168 101515 0 30670 PIO-Dalve 2548 28 12 9667 27 142 105349 0 30670 PIO-Danti 1843 29 5 9672 28 135 109129 0 30670 PIO-Dapic 2530 30 4 9676 29 121 112638 0 30670 PIO-Dcaya 1968 31 3 9679 30 101 115668 0 30670 PIO-Dcine 16 32 2 9681 PIO-Dcolo 2990 33 0 9681 PIO-Derio 2624 34 1 9682 PIO-Dfron 1839 35 0 9682 PIO-Dglut 4488 36 1 9683 PIO-Djene 1373 37 1 9684 PIO-Dphyl 1664 38 1 9685 PIO-Drevo 1850 39 0 9685 PIO-Drhom 1977 40 0 9685 PIO-Drosm 1566 41 2 9687 PIO-Drupe 2364 42 2 9689 PIO-Dschu 1824 43 0 9689 PIO-Dtenu 1521 44 0 9689 |
The cluster alignments now accept initial gaps which are common due to the new quality filtering that trims bases on both ends and also to allelic variation. I imagine also that the option |-leftjust| was creating many singletons or even separate loci, at least I can see the number of singletons is reduced in step 3.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dereneaton/ipyrad/issues/193, or mute the thread https://github.com/notifications/unsubscribe-auth/AFJUGEN0Chqh4Ne2QtCrJkdr0PMDHDHUks5qxldagaJpZM4KRF8u.
Cool! Yes, when I read the note I decided to try disabling the option and making the comparison.
Roughly half the loci were dropped in this dataset because of the duplicates filter, so when comparing how the clustering was made in step 3 and in step 6 the only major difference I could observe was the use of the option
-leftjust
incluster_within.py
.Here is a little comparison made by disabling that option and reanalyzing this dataset:
-leftjust
enabled:And then I commented the relevant lines in
cluster_within.py
disabling-leftjust
:The cluster alignments now accept initial gaps which are common due to the new quality filtering that trims bases on both ends and also to allelic variation. I imagine also that the option
-leftjust
was creating many singletons or even separate loci, at least I can see the number of singletons is reduced in step 3.