ANGSD / angsd

Program for analysing NGS data.
228 stars 50 forks source link

How do you unflatten a 2dSFS? Is Pop1 nrows or ncols in the matrix? #486

Closed TeresaPegan closed 2 years ago

TeresaPegan commented 2 years ago

Hello, I would like to turn my flattened 2dSFS into an actual matrix. In order to do this, for example in R, we have to know the dimensions of the matrix (rows and columns), and we have to know whether the matrix should be filled in row-wise or column-wise.

The ANGSD documentation is not explicit about how to do this. From the popgen.dk page we can infer that Pop1 (i.e. the first SAF file given in the command to create the SFS) should represent the number of columns in the matrix. http://www.popgen.dk/angsd/index.php/2d_SFS_Estimation

Assume you have a 12 bamfiles for population in the file pop1.list
Assume you have a 14 bamfiles for population in the file pop2.list

The output is then located in a nice flattened matrix format(25x29) in the file: 2dsfs.sfs. Good luck visualising it, some people are using dadi, we have been using heat maps in R.

If we assume that the matrix format is given in the normal way, X value (ncols) by Y value (nrows), then ncols=25. That corresponds to pop1.list, where there were 12 bams, and the SFS dimension from 12 diploids should be 25.

The reason why I ask this is that in one of the scripts provided in this program, 'angsd/R/fstFrom2dSFS.R', line 2 implies the opposite:

 N1<-nrow(est)-1

Generally I think it's reasonable to interpret "N1" to refer to population 1 size. This script would therefore imply that population 1 is represented in the number of rows of the 2dsfs. However, maybe the N1 is just supposed to refer to a dimension and not the population?

1) Could someone confirm which of these interpretations is correct? Is Pop1 supposed to be ncols or nrows? My hunch is that the popgen.dk website is right and the R script is wrong (or that N1 simply has nothing to do with Pop1 in that script), but there are other typos in the popgen.dk documentation so I wanted to be sure....

2) Should the matrix be built by row (byrow=T in R) or by column (byrow=F in R)? In this script from ANGSD (angsd/R/fstFrom2dRHELLER.R), byrow=T. But again, I would like to check, because I am not sure whether these scripts are accurate and it has a big impact on the results.

Thanks, -Teresa

z0on commented 2 years ago

maybe this?.. (in R)

sfs2matrix=function(sfs,n1,n2) { dd=matrix(ncol=2n1+1,nrow=2n2+1,sfs) return(apply(t(dd),2,rev)) }

On Wed, May 11, 2022 at 12:42 PM Teresa @.***> wrote:

Hello, I would like to turn my flattened 2dSFS into an actual matrix. In order to do this, for example in R, we have to know the dimensions of the matrix (rows and columns), and we have to know whether the matrix should be filled in row-wise or column-wise.

The ANGSD documentation is not explicit about how to do this. From the popgen.dk page we can infer that Pop1 (i.e. the first SAF file given in the command to create the SFS) should represent the number of columns in the matrix. http://www.popgen.dk/angsd/index.php/2d_SFS_Estimation

Assume you have a 12 bamfiles for population in the file pop1.list Assume you have a 14 bamfiles for population in the file pop2.list

The output is then located in a nice flattened matrix format(25x29) in the file: 2dsfs.sfs. Good luck visualising it, some people are using dadi, we have been using heat maps in R.

If we assume that the matrix format is given in the normal way, X value (ncols) by Y value (nrows), then ncols=25. That corresponds to pop1.list, where there were 12 bams, and the SFS dimension from 12 diploids should be 25.

The reason why I ask this is that in one of the scripts provided in this program, 'angsd https://github.com/ANGSD/angsd/R https://github.com/ANGSD/angsd/tree/master/R/fstFrom2dSFS.R', line 2 implies the opposite:

N1<-nrow(est)-1

Generally I think it's reasonable to interpret "N1" to refer to population 1 size. This script would therefore imply that population 1 is represented in the number of rows of the 2dsfs. However, maybe the N1 is just supposed to refer to a dimension and not the population?

1.

Could someone confirm which of these interpretations is correct? Is Pop1 supposed to be ncols or nrows? My hunch is that the popgen.dk website is right and the R script is wrong (or that N1 simply has nothing to do with Pop1 in that script), but there are other typos in the popgen.dk documentation so I wanted to be sure.... 2.

Should the matrix be built by row (byrow=T in R) or by column (byrow=F in R)? In this script from ANGSD (angsd https://github.com/ANGSD/angsd/R https://github.com/ANGSD/angsd/tree/master/R/fstFrom2dRHELLER.R), byrow=T. But again, I would like to check, because I am not sure whether these scripts are accurate and it has a big impact on the results.

Thanks, -Teresa

— Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/486, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZUHGBPLRNKTSGAJ6FMMY3VJPWPZANCNFSM5VVWHQMQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

TeresaPegan commented 2 years ago

Hi, thanks for the function! This gives me a different result than the one I had based off of what was written in (angsd/R/fstFrom2dRHELLER.R)).

code from that script:

full<-scan("2dsfs_auto_ancCorrect_indFilters_chapmani_MB.sfs")
fullM<-matrix(full, ncol=35, nrow=17, byrow=T)

(I replaced the ncol and nrow with what I think were the appropriate numbers for my data -- see my code below).

Your function produces a matrix where 2*N1+1 actually becomes the row number, not the column number. Thus, taking the example from the popgen.dk docs, I think your function would produce a 29x25 matrix, not the 25x29 matrix described there.

I am attaching plots I get of the 2dsfs from each method, using the same sfs. What do you think? Other than the potential row/column switching, I think your function produces a more reasonable looking result...?

Thanks!

-Teresa

est <- scan("~/Downloads/ABCN.txt")
pop1len <- 7
pop2len <- 15

estm1 <- matrix(est, ncol=2*pop1len+1,nrow=2*pop2len+1, byrow=T)

estm2 <- sfs2matrix(est, pop1len, pop2len)

image(1:ncol(estm1), 1:nrow(estm1), t(estm1), col = terrain.colors(60), axes = FALSE,
      main="fstFrom2dRHELLER method")
axis(1, 1:ncol(estm1), colnames(estm1))
axis(2, 1:nrow(estm1), rownames(estm1))

image(1:ncol(estm2), 1:nrow(estm2), t(estm2), col = terrain.colors(60), axes = FALSE,
      main="z0on method")
axis(1, 1:ncol(estm2), colnames(estm2))
axis(2, 1:nrow(estm2), rownames(estm2))

image

image

z0on commented 2 years ago

I was actually not sure my function did it right, but now I tested it a bit more and the result looks correct. Thanks for pointing out the x-y switch bug - pasted the corrected code below. Now pop1 (n1) ends up on x axis, as originally intended.

(Looks like in the other function the n1 and n2 were simply switched?)

sfs2matrix=function(sfs,n1,n2) { dd=matrix(ncol=2n1+1,nrow=2n2+1,sfs) return(apply(dd,2,rev)) }

On Wed, May 11, 2022 at 3:27 PM Teresa @.***> wrote:

Hi, thanks for the function! This gives me a different result than the one I had based off of what was written in (angsd https://github.com/ANGSD/angsd/R https://github.com/ANGSD/angsd/tree/master/R/fstFrom2dRHELLER.R)).

code from that script:

full<-scan("2dsfs_auto_ancCorrect_indFilters_chapmani_MB.sfs") fullM<-matrix(full, ncol=35, nrow=17, byrow=T)

(I replaced the ncol and nrow with what I think were the appropriate numbers for my data -- see my code below).

Your function produces a matrix where 2*N1+1 actually becomes the row number, not the column number. Thus, taking the example from the popgen.dk docs http://www.popgen.dk/angsd/index.php/2d_SFS_Estimation, I think your function would produce a 29x25 matrix, not the 25x29 matrix described there.

I am attaching plots I get of the 2dsfs from each method, using the same sfs. What do you think? Other than the potential row/column switching, I think your function produces a more reasonable looking result...?

Thanks!

-Teresa

est <- scan("~/Downloads/ABCN.txt") pop1len <- 7 pop2len <- 15

estm1 <- matrix(est, ncol=2pop1len+1,nrow=2pop2len+1, byrow=T)

estm2 <- sfs2matrix(est, pop1len, pop2len)

image(1:ncol(estm1), 1:nrow(estm1), t(estm1), col = terrain.colors(60), axes = FALSE, main="fstFrom2dRHELLER method") axis(1, 1:ncol(estm1), colnames(estm1)) axis(2, 1:nrow(estm1), rownames(estm1))

image(1:ncol(estm2), 1:nrow(estm2), t(estm2), col = terrain.colors(60), axes = FALSE, main="z0on method") axis(1, 1:ncol(estm2), colnames(estm2)) axis(2, 1:nrow(estm2), rownames(estm2))

[image: image] https://user-images.githubusercontent.com/35701568/167941170-29747852-f14b-4a75-a449-143103658ec2.png

[image: image] https://user-images.githubusercontent.com/35701568/167941195-b9e2b6d1-7f56-4ef5-8d0a-45255c943f1f.png

— Reply to this email directly, view it on GitHub https://github.com/ANGSD/angsd/issues/486#issuecomment-1124264387, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABZUHGCAXNURLJANP6IYGQDVJQJ2NANCNFSM5VVWHQMQ . You are receiving this because you commented.Message ID: @.***>

nspope commented 2 years ago

In the latest version of realSFS there's a subcommand bins, here is the usage string

Usage: realSFS bins [options] deme1.saf.idx [deme2.saf.idx ...]

Options: -fold 0 (0 = unfolded, 1 = folded SFS)

Notes: Returns allele counts per population (cols) for each SFS bin (rows). If folded, bins that are out-of-bounds are marked as 'NaN'

Which could be used to double check your code or to build a vector-to-array mapping for whatever programming language

TeresaPegan commented 2 years ago

Thanks, that does seem really helpful! -Teresa

ANGSD commented 2 years ago

I assume this has been resolved so I am closing this issue.

space-beaver commented 1 year ago

Hi @TeresaPegan @z0on,

I found this (modified) code on issue #338 by nspope, for plotting comparisons and the log scale really improves the visuals. This is for 2D unfolded SFS - I'm going to try and figure out how to use it to plot the folded SFS but haven't got that far yet. Images are with and without log transformed data.

library(ggplot2) library(reshape2) realsfs <- melt(matrix(scan("ARR_CEN_plot.sfs"),9,39)) df <- rbind(data.frame(realsfs, est="realSFS")) df$est <- factor(df$est,levels=c("realSFS"))

ggplot(df) + geom_tile(aes(x=Var1, y=Var2, fill=log(ifelse(value<1e-1,0,value)))) + scale_fill_gradientn("log(sites)", colors=c("red", "yellow", "green", "dodgerblue", "blue", "purple"), na.value="white") + scale_x_continuous(expand=c(0,0)) + scale_y_continuous(expand=c(0,0)) + theme_bw() + xlab("Freq in Pop1") + ylab("Freq in Pop2") + facet_grid(~est) + theme(panel.grid=element_blank())

image

image

teegi1 commented 1 year ago

Hi,

Is there any modified R code to visualize folded 2dSFS? ANGSD documentation is not mentioned how to do this.

Thanks, Thilini

space-beaver commented 9 months ago

Hi,

Is there any modified R code to visualize folded 2dSFS? ANGSD documentation is not mentioned how to do this.

Thanks, Thilini

@teegi1 did you manage to find a method in R for folded 2dsfs?