How to get the genotype file - Githubissues

bcm-uga / pcadapt

Performing highly efficient genome scans for local adaptation with R package pcadapt v4

https://bcm-uga.github.io/pcadapt

37 stars 10 forks source link

How to get the genotype file #66

Closed hjt1129 closed 3 years ago

hjt1129 commented 3 years ago

We are now using pcadapt package to analyze our data, but we meet some problems. The main problem is how to get the genotype file? Our original data is one fasta file with 428 phased sequences of 214 individuals, all the sequences belong to one gene.

We have tried several methods to get vcf format file which can be used for read.pcadapt convertion. One method we used is this R code (https://rdrr.io/github/gehara/Junkyardtools/man/fasta2VCF.html), but when run read.pcadapt function, it goes the error “Error in vcfR::read.vcfR(input, verbose = FALSE) : File: E:/hongwei/hw.vcf does not appear to be a VCF file. First line of file: E:/hongwei/hw.vcf Should begin with: ##fileformat=VCFv In addition: Warning message: In file2other(input, type, match.arg(type.out), match.arg(allele.sep)) : Converter vcf to pcadapt is deprecated. Please use PLINK for conversion to bed (and QC).”

Another method we used is samtools and bcftools, one of 428 phased sequences was used as index, and follow the following seteps:

$ bwa index refref.fa

$ bwa mem -t 12 -M ref.fa sample.fasta >sample.sam

$ samtools sort -@ 12 -o sample.bam sample.sam

$ samtools mpileup -uf ref.fa sample.bam | bcftools call -mv >var.vcf

similar error "Warning message: In file2other(input, type, match.arg(type.out), match.arg(allele.sep)) : Converter vcf to pcadapt is deprecated. Please use PLINK for conversion to bed (and QC)." happens.

So we also tried to use PLINK to convert this vcf file to bed format with following code:

plink --vcf var.vcf --allow-extra-chr --maf 0.05 --recode --out var

plink --file var --maf 0.05 --allow-extra-chr --make-bed --out var

But when we run the code “x <- pcadapt(input = filename, K = 20)”, it shows “Error: Can't compute SVD. Are there SNPs or individuals with missing values only? You should use PLINK for proper data quality control.”

We now think something may be wrong during fasta to vcf format convertion, do you have any suggestions for us? or you have any proper code or package that can share to us? Many thanks.

privefl commented 3 years ago

You should be able to go from VCF to bed in one call of PLINK.

You should probably also filter on missing values using at least --mind 0.5 --geno 0.5 (in the same call).

hjt1129 commented 3 years ago

Thank you very much for your reply. We have new question is that: when we use K=i, there has i sigular.values, so what's the meaning of singular.values, is this the percentage of variance for each PC?

privefl commented 3 years ago

Please see https://github.com/bcm-uga/pcadapt/issues/46#issuecomment-586972826.

hjt1129 commented 3 years ago

This means if we want to see the proportion for first two PCs, it should be singular.values[1]^2+singular.values[2]^2, right?

privefl commented 3 years ago

Yes

chenqing-1996 commented 3 years ago

You should be able to go from VCF to bed in one call of PLINK.

You should probably also filter on missing values using at least --mind 0.5 --geno 0.5 (in the same call).

Hello privefl and hjt1129, I have filtered my bedfile using plink with ' --mind 0.01 --geno 0.01', but when I run the code “x <- pcadapt(input = filename, K = 20)”, it still showed “Error: Can't compute SVD. Are there SNPs or individuals with missing values only? You should use PLINK for proper data quality control.” Do you know why that is ?

privefl commented 3 years ago

How many samples/variants have you left?

chenqing-1996 commented 3 years ago

How many samples/variants have you left?

Thank you for your reply! I have 15 samples with 2,856,704 variants.

privefl commented 3 years ago

You can't get 20 PCs with 15 individuals. Your filtering is too strong.

chenqing-1996 commented 3 years ago

You can't get 20 PCs with 15 individuals. Your filtering is too strong.

I didn't understand principal component analysis well enough, and I was stupid to think that the number K of principal componentsn could take any value. So is PCs related to samples? What is a reasonable number that I should take?

privefl commented 3 years ago

I guess you should not use more than K=N/10, or somethink like that. But to get sufficient power for pcadapt, we kind of expect that you have at least 200 individuals.

chenqing-1996 commented 3 years ago

well, privefl, actually I only have 15 samples. Is small samples not suitable for analysis using pcadapt?

privefl commented 3 years ago

I don't think it is.

chenqing-1996 commented 3 years ago

Hello privefl， I have a question about the final results of outlier file, is it the sequence of SNPs in the corresponding VCF file?

privefl commented 3 years ago

The order in the input files, the .bed file (so the .bim file as well). Please open a new issue for every new (unrelated) question.

chenqing-1996 commented 3 years ago

The order in the input files, the .bed file (so the .bim file as well). Please open a new issue for every new (unrelated) question.

well,thak you so much

PhDMattyB commented 1 year ago

Hi privefl,

I have a similar question to the ones that have been posted. I filtered my data in plink using --maf 0.05, --min 0.5, --geno 0.5, and exported the data in the bed file format. I have 109 individuals as 173485 SNPs after filtering in plink. I am still getting the 'Error: can't compute SVD. Are there SNPs or individuals with missing values only?'. As far as I know the bed file should be good to go, I've not run into this error in pcadapt before. Any advice would be greatly appreciated.

privefl commented 1 year ago

I don't know actually, it does not make sense. Are you sure you're using the new bed file?

Also, you can try to have a look at the genotype matrix with pcadapt::bed2matrix().