bodkan / admixr

An R package for reproducible and automated ADMIXTOOLS analyses
https://bodkan.net/admixr
Other
28 stars 9 forks source link

"d" or "f4" statistics output always stderr = 1 #56

Closed DanielRamos12 closed 4 years ago

DanielRamos12 commented 4 years ago

Dear Martin;

I'm trying to analyse my Rad-seq data using your fantastic Admixr package. The package itself works perfectly. However, when I run the "d" or "f4" statistics functions the output is always the following;

image

The other functions such as f4ratio() or f3() worked fine.

I tried to do it manually (I followed the link below), and I found that some of my individuals may have a few loci with NA answers (If this is the issue, I don't know how to filter the geno/snps file).

http://evomics.org/learning/population-and-speciation-genomics/2018-population-and-speciation-genomics/abba-baba-statistics/

I hope you are saved and okay,

Best wishes,

Daniel

bodkan commented 4 years ago

Dear Daniel,

I'm continuing our email conversation here as Github helps me (and potentially others) to keep track of previous issues. I'm not going to share the data you sent me though, don't worry.

First of all, I can confirm the issue using your data:

## I started R in the same directory where I unpacked your zip file

library(admixr)

snps <- eigenstrat("c_nem_rename_Def")

## can confirm your issue
d(W = "Pyrenean", X="Belgian", Y="European", Z="Hungarian", data=snps)
## # A tibble: 1 x 10
##   W        X       Y        Z             D stderr Zscore  BABA  ABBA nsnps
##   <chr>    <chr>   <chr>    <chr>     <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 Pyrenean Belgian European Hungarian     0      1      0     0     0     0

Note that the count of observed BABA and ABBA counts (as well as informative positions nsnps) reported is 0. I confirmed that this is not an issue with admixr parsing (As I can reproduce this issue even using standard command-line ADMIXTOOLS analysis without admixr being involved). ADMIXTOOLS truly doesn't see any informative variation in your data. There are no BABA or ABBA patterns it can calculate on, hence it reports such a vague uninformative result. (The reason other statistics work as you mention, which I also confirmed, is that there is enough variation in those to do a calculation on, for instance f3).

I went ahead and looked at the genotypes manually (just samples involved in the D calculation you mentioned) and I can confirm that there are individuals with missing data, but that alone does not present a problem for ADMIXTOOLS calculation of D statistics.

I'm afraid I can immediately say what is wrong here because I don't understand the nature of your data (species, etc).

I suggest to start simple and pick a set of nice and easy analyses for which you know what the outcome should be and test if you get a D statistic that confirms it. For my work, this would be something totally obvious and textbook like, such as some simple D statistic detecting the presence of Neanderthal ancestry in a population which we know it carries (just an example).


More importantly, perhaps, I took a look at the snp file you provided and saw something like this:


> read_snp(snps)
## # A tibble: 10,711 x 6
##    id              chrom   gen   pos ref   alt  
##    <chr>           <chr> <dbl> <int> <chr> <chr>
##  1 ctg11806:73     1     0.890     1 C     T    
##  2 tig00045252:116 1     0.259     2 C     G    
##  3 tig00045252:132 1     0.259     3 T     C    
##  4 tig00045252:141 1     0.259     4 T     C    
##  5 ctg22966:157    1     0.669     5 C     T    
##  6 ctg13176:161    1     0.139     6 G     A    
##  7 ctg22966:179    1     0.669     7 A     G    
##  8 ctg22966:192    1     0.669     8 A     G    
##  9 ctg13176:196    1     0.139     9 A     G    
## 10 tig00045252:207 1     0.259    10 G     T    
## # … with 10,701 more rows

I was a bit puzzled to see those strange SNP identifiers and wondered why is all your data on chromosome 1 only. But then, when I check your VCF file, I saw this data is not actually for chromosome 1 but the SNP IDs are contigs? I presume your data is not human, right?

I wonder why did you format the snp file like this? The information about chromosomes (and associated genetic distances) is important for ADMIXTOOLS. I'm afraid if your data is truly composed of so many individual contigs, the calculation will not work as ADMIXTOOLS itself does not support this (see the format specification). I had people reporting issues like this in the past (having scattered assemblies from a yet underscribed species).

So, although it's disappointing, I don't think this is an issue with my package and I'm not really sure how to help here.

bodkan commented 4 years ago

After further discussion over email we both agreed the problem is not related to the admixr package but is rather a challenge of the data itself.

Closing for now and will try to comment back once there are further developments (in ADMIXTOOLS or the issue with the data is solved).