Open cdoorenweerd opened 3 years ago
Hm, can you email me the .usnps file from the assembly you're talking about? Or post it here if it's not that big. How did you calculate the % variable sites for your usnps file? What version of ipyrad are you running?
As you say, missing data (N) and indels (-) should not be counted as "variable" for the sake of unlinked snp selection, and this is indeed the intended behavior. I looked at a couple empirical assemblies and also at the code to double-check that this is the case and my belief is that N and - should not be counted when differentiating variable vs invariable sites.
Send me your usnps file and tell me how you calculated %variable and what version of ipyrad you're running and we should be able to get it sorted out.
Appreciate you getting back so quick! Attached is the uSNP alignment created with ipyrad v 0.9.81 installed through anaconda channel bioconda. I calculated the % variable with AMAS (https://github.com/marekborowiec/AMAS), attached is the summary output from that. Thanks for checking the code - I am now doubting myself as to how I got this dataset, I ran it several months ago. Maybe I removed taxa from the original ipyrad output uSNP file (which I sadly didn't keep) because we had a lot of repetition per species and removed samples with less coverage. Maybe that inadvertently removed a lot of intraspecific variation, leaving uninformative sites at the inter-species level. I apologize if that turns out to be the case! I will re-run my data from the raw reads. If you cannot repeat my issue it is probably safe to close it on github
On Fri, 13 Aug 2021 at 16:50, Isaac Overcast @.***> wrote:
As you say, missing data (N) and indels (-) should not be counted as "variable" for the sake of unlinked snp selection, and this is indeed the intended behavior. I looked at a couple empirical assemblies and also at the code to double-check that this is the case and my belief is that N and
- should not be counted when differentiating variable vs invariable sites.
Send me your usnps file and tell me how you calculated %variable and what version of ipyrad you're running and we should be able to get it sorted out.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dereneaton/ipyrad/issues/454#issuecomment-898803838, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPWZMQ4VBY7BKM7IPA73DTT4XKZBANCNFSM5CESUYLA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
Alignment_name No_of_taxa Alignment_length Total_matrix_cells Undetermined_characters Missing_percent No_variable_sites Proportion_variable_sites Parsimony_informative_sites Proportion_parsimony_informative AT_content GC_content A C G T K M R Y S W B V H D X N O - ? GBS_dor_phylogeny_16.usnps_noref.nex 418 694 290092 124917 43.061 349 0.503 218 0.314 0.534 0.466 41549 36691 39016 45491 246 244 565 724 157 492 0 0 0 0 0 124917 0 0 0
I don't see the usnps file attached, so I can't really say what's going on. If you did a bunch of post-processing then this could certainly change the state of variable sites to invariable. Also, I'm not certain, but I looked at the AMAS code, and it looks like it throws out ambiguous bases during calculation of variable sites: https://github.com/marekborowiec/AMAS/blob/master/amas/AMAS.py (around line 787). This would also certainly change the outcome when calculating # of variable sites. Perhaps AMAS is making some kind of assumption about the input data that the ipyrad usnps file is not meeting.
After running ipyrad, my resulting dataset with full loci has an aligned length of 255,532 bp, 72.6% missing site data and 7.8% variable. For the unlinked SNP selection from this set I would expect 100% of the resulting sites to be variable. But instead the uSNP alignment with 694 sites only has 28.2% variable sites - the only explanation I have for this is that iPyrad must count Ns (missing data) as candidate SNPs when selecting an unlinked SNP per locus? I think that would be unwanted behavior?