ignore Ns for uSNP selection

cdoorenweerd commented 3 years ago

After running ipyrad, my resulting dataset with full loci has an aligned length of 255,532 bp, 72.6% missing site data and 7.8% variable. For the unlinked SNP selection from this set I would expect 100% of the resulting sites to be variable. But instead the uSNP alignment with 694 sites only has 28.2% variable sites - the only explanation I have for this is that iPyrad must count Ns (missing data) as candidate SNPs when selecting an unlinked SNP per locus? I think that would be unwanted behavior?

isaacovercast commented 3 years ago

Hm, can you email me the .usnps file from the assembly you're talking about? Or post it here if it's not that big. How did you calculate the % variable sites for your usnps file? What version of ipyrad are you running?

isaacovercast commented 3 years ago

As you say, missing data (N) and indels (-) should not be counted as "variable" for the sake of unlinked snp selection, and this is indeed the intended behavior. I looked at a couple empirical assemblies and also at the code to double-check that this is the case and my belief is that N and - should not be counted when differentiating variable vs invariable sites.

Send me your usnps file and tell me how you calculated %variable and what version of ipyrad you're running and we should be able to get it sorted out.

cdoorenweerd commented 3 years ago

Appreciate you getting back so quick! Attached is the uSNP alignment created with ipyrad v 0.9.81 installed through anaconda channel bioconda. I calculated the % variable with AMAS (https://github.com/marekborowiec/AMAS), attached is the summary output from that. Thanks for checking the code - I am now doubting myself as to how I got this dataset, I ran it several months ago. Maybe I removed taxa from the original ipyrad output uSNP file (which I sadly didn't keep) because we had a lot of repetition per species and removed samples with less coverage. Maybe that inadvertently removed a lot of intraspecific variation, leaving uninformative sites at the inter-species level. I apologize if that turns out to be the case! I will re-run my data from the raw reads. If you cannot repeat my issue it is probably safe to close it on github

On Fri, 13 Aug 2021 at 16:50, Isaac Overcast @.***> wrote:

As you say, missing data (N) and indels (-) should not be counted as "variable" for the sake of unlinked snp selection, and this is indeed the intended behavior. I looked at a couple empirical assemblies and also at the code to double-check that this is the case and my belief is that N and

should not be counted when differentiating variable vs invariable sites.

Send me your usnps file and tell me how you calculated %variable and what version of ipyrad you're running and we should be able to get it sorted out.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dereneaton/ipyrad/issues/454#issuecomment-898803838, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACPWZMQ4VBY7BKM7IPA73DTT4XKZBANCNFSM5CESUYLA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

Alignment_name No_of_taxa Alignment_length Total_matrix_cells Undetermined_characters Missing_percent No_variable_sites Proportion_variable_sites Parsimony_informative_sites Proportion_parsimony_informative AT_content GC_content A C G T K M R Y S W B V H D X N O - ? GBS_dor_phylogeny_16.usnps_noref.nex 418 694 290092 124917 43.061 349 0.503 218 0.314 0.534 0.466 41549 36691 39016 45491 246 244 565 724 157 492 0 0 0 0 0 124917 0 0 0

isaacovercast commented 3 years ago

I don't see the usnps file attached, so I can't really say what's going on. If you did a bunch of post-processing then this could certainly change the state of variable sites to invariable. Also, I'm not certain, but I looked at the AMAS code, and it looks like it throws out ambiguous bases during calculation of variable sites: https://github.com/marekborowiec/AMAS/blob/master/amas/AMAS.py (around line 787). This would also certainly change the outcome when calculating # of variable sites. Perhaps AMAS is making some kind of assumption about the input data that the ipyrad usnps file is not meeting.

dereneaton / ipyrad

ignore Ns for uSNP selection #454