bcm-uga / pcadapt

Performing highly efficient genome scans for local adaptation with R package pcadapt v4
https://bcm-uga.github.io/pcadapt
37 stars 10 forks source link

Help for using pcadapt with poolseq data #86

Closed ElMujtarVeronica closed 7 months ago

ElMujtarVeronica commented 7 months ago

Dear Florian, I have a PhD student that is analysing ddRADSeq data obtained using poolseq strategy for sequencing. He used stacks for de novo-assembly, exported the consensus sequences of each locus and then used Popoolation2 to call SNPs. We want to use pcadapt to identify outlier loci, however Popoolation2 did not provide a vcf files that we could use. We previously run pcadapt to identify outliers using data from individuals and the corresponding vcf file that was transformed to .bed file.

Now we only have the sync or rc files of Popoolation2 but we don’t know how to transform these files to generate a file with the pcadapt format for poolseq data.

We are reading the pcadapt tutorial = Using pcadapt to detect local adaptation • pcadapt (bcm-uga.github.io)

And follow the details of point G “Detecting local adaptation with pooled sequencing data”

We considered this part, to try to understand the expected format of the genotype matrix A Pool-seq example is provided in the package, and can be loaded as follows: pool.data <- system.file("extdata", "pool3pops", package = "pcadapt") filename <- read.pcadapt(pool.data, type = "pool")

But we could not understand was it’s the meaning of the values “e.g., 0.1, 0.67, 0.45, 0.02…” of the filename object.

Could you please to help us to understand the format file or to provide us some tool to generate the input file from sync or rc files of Popoolation2?

Thanks

privefl commented 7 months ago

The variable name filename is really confusing here, sorry. The output from read.pcadapt(pool.data, type = "pool") is actually an R matrix, not a file path.

We assume that the user provides a matrix of relative frequencies with n rows and L columns (where n is the number of populations and L is the number of genetic markers).

It is probably an error from copy-pasting from other types of data. I'll try to update the tuto soon.

privefl commented 7 months ago

Is it okay for you now?

ElMujtarVeronica commented 7 months ago

Hi, thanks for the detail about the matrix that pcadapt expect.

However, we don't know:

1- how you obtain this matrix from poolseq data 2- the exact format of the matrix.

In the case of the software popoolation2 the produced files (rc or sync) are based on number counts, not frequencies. So we need a quick way to transform allele counts to frequencies. For the other side, is n the number of SNPs and you consider the frequency of the major allele (considering all populations) or is n the total number of alleles (2x number of SNPs) and you consider the frequency of the major and minor alleles?

Thanks

privefl commented 7 months ago

As said before, it is expecting a matrix of relative frequencies with n rows and L columns (where n is the number of populations and L is the number of genetic markers). I don't think the order of the allele should matter, but it should be consistent across populations. I don't know the software popoolation2, so I can't help you with that.

ElMujtarVeronica commented 7 months ago

Thanks for the answer!