Global ancestry analysis with GTEx 7 Genotypes vs 1000G

gaow commented 7 years ago

Broad has release on dbGaP PCA on genotypes before imputation. Matching sample ID with self-ascertained ancestry information also found in dbGaP, I plotted PC1 vs PC2 below:

officialpca

There are clearly two dominant populations but not separated. Feeling unhappy with what I see I started doing it myself, using genotypes after imputation. I filtered SNPs to have MAF > 10% and remove SNPs in LD with each other by r^2 = 0.2. This leaves 205,092 SNPs. I then randomly selected 40K from these SNPs for PCA. The first 2 PC now looks like:

mypca

It's somewhat more worrisome now, because not only documented global ancestry does not match PCA, but also my PCA does not match with Broads.

This is an important issue because the PCs will be relevant to our downstream analysis. I'm still looking into what's going on but any thoughts / tips is welcomed!

MDS analysis

I've also tried an alternative via KING software: they use Multidimensional Scaling with the Euclidean distance. It is much faster than PCA, giving very similar results:

mymds

ETHNICITY or RACE

There are two columns in the phenotype data called RACE and ETHNICITY. The plots above are based on RACE. But with ETHNICITY similar pattern was observed (still do not align). In fact I first tried ETHNICITY because it seems to make more sense based on what the data looks like:

> summary(factor(out$ETHNCTY))
  0   1  98  99
402  14  31 305
> summary(factor(out$RACE))
  1   2   3   4  98  99
  8  94 641   3   1   5

Yet neither field aligns with either global ancestry analysis.

stephens999 commented 7 years ago

What are the labels pop1 pop2 etc?

On May 23, 2017 17:18, "gaow" notifications@github.com wrote:

Broad has release on dbGaP PCA on genotypes before imputation. Matching sample ID with self-ascertained ancestry information also found in dbGaP, I plotted PC1 vs PC2 below:

[image: officialpca] https://cloud.githubusercontent.com/assets/917985/26374337/9fd71f80-3fca-11e7-9d3c-da79b20f122b.png

There are clearly two dominant populations but not separated. Feeling unhappy with what I see I started doing it myself, using genotypes after imputation. I filtered SNPs to have MAF > 10% and remove SNPs in LD with each other by r^2 = 0.2. This leaves 205,092 SNPs. I then randomly selected 40K from these SNPs for PCA. The first 2 PC now looks like:

[image: mypca] https://cloud.githubusercontent.com/assets/917985/26378727/6a34a5b2-3fdb-11e7-9b06-a3fb351e6775.png

It's somewhat more worrisome now, because not only documented global ancestry does not match PCA, but also my PCA does not match with Broads.

This is an important issue because the PCs will be relevant to our downstream analysis. I'm still looking into what's going on but any thoughts / tips is welcomed! MDS analysis

I've also tried an alternative via KING software: they use Multidimensional Scaling with the Euclidean distance. It is much faster than PCA, giving very similar results:

[image: mymds] https://cloud.githubusercontent.com/assets/917985/26378761/9bbb7a20-3fdb-11e7-9254-480b8f617cad.png ETHNICITY or RACE

There are two columns in the phenotype data called RACE and ETHNICITY. The plots above are based on RACE. But with ETHNICITY similar pattern was observed (still do not align). In fact I first tried ETHNICITY because it seems to make more sense based on what the data looks like:

summary(factor(out$ETHNCTY)) 0 1 98 99402 14 31 305> summary(factor(out$RACE)) 1 2 3 4 98 99 8 94 641 3 1 5

Yet neither field aligns with either global ancestry analysis.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gaow/mvarbvs/issues/15, or mute the thread https://github.com/notifications/unsubscribe-auth/ABt4xc9NVYHO7RELwR4ZHSRFPlB9L3Pgks5r81sggaJpZM4NkXU- .

gaow commented 7 years ago

That I've yet to find out -- the phenotype file does not say anything about what these numbers mean. The are simply coded as eg

SAMPID   RACE
GTEX-WXF-111D  1
...

I do not know the meaning of the RACE code, but apparently ETHNICITY and RACE are quite different:

> summary(factor(out$ETHNCTY))
  0   1  98  99
402  14  31 305
> summary(factor(out$RACE))
  1   2   3   4  98  99
  8  94 641   3   1   5

Neither of them aligns with the plot. I need to find additional sample info or email them for the labels. It is weird they do not provide it explicitly.

pcarbo commented 7 years ago

@gaow I think you will get more insight into the ETHNCTY and RACE labels if you project onto the PCs computed from the 1000 Genomes samples. I've done this before---happy to help.

gaow commented 7 years ago

@pcarbo great pointer! As long as I have 1000G in PLINK format with selected markers for PCA it should be pretty easy job -- merge with my data then run PCA. I am only familiar of the data in VCF format with all markers. Would be nice if you could point me to the data or some scripts that process them? Or I can ask people on the floor who has it?

Now I'm looking at how my procedure works on data before imputation. Some data lifting at the moment but should be done very soon.

pcarbo commented 7 years ago

@gaow Below are the data preparation steps I've taken. Note that you might not want to be so aggressive with the LD pruning as I was. It sounds like you're familiar with the PCA step, but so you know in the past I have used the rsvd R package and svdk in MATLAB.

Retrieving the genotype data

First, download the 1000 Genomes OmniExpress genotype data. This file contains genotypes for 2,401,408 single nucleotide polymorphisms (SNPs) and 2,318 samples.

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/hd_genotype_chip/ALL.chip.omni_broad_sanger_combined.20140818.snps.genotypes.vcf.gz
mv ALL.chip.omni_broad_sanger_combined.20140818.snps.genotypes.vcf.gz 1kg.vcf.gz

Next, convert the from VCF to binary PLINK (.bed/.bim/.bam) format, retaining only SNPs on autosomal chromosomes 1-22. Also, remove any SNPs in which the proportion of missing genotypes is unusually high (>5%), or in which the frequency of the minor allele is less than 1%. These steps are taken to simplify the analyses of these data, perhaps at the expense of removing some genetic markers that might expose more subtle patterns in the data.

module load plink/1.90
plink --vcf 1kg.vcf.gz --make-bed --chr 1-22 --allow-extra-chr \
  --geno 0.05 --maf 0.01 --out 1kg

This command removes 315,046 SNPs with too many missing genotypes, and an additional 281,593 SNPs with minor allele frequencies less than 1%.

Next, retain genotype data only for the variants in the dbSNP reference database (these are SNPs with ids starting with "rs"). This step is mainly taken to make the size of the data set more manageable for our statistical analyses.

grep rs 1kg.bim | grep -v SNP 1kg.bim | cut -f 2 > markers.txt
plink --bfile 1kg --make-bed --extract markers.txt --out 1kg_new

At this point, we have genotype data for 655,388 SNPs and 2,318 samples. Many of these >600,000 SNPs still contain somewhat redundant information because they are in "linkage disequilibrium" with each other; that is, they are correlated due to having experienced recombination less frequently since they are close to each other on the same chromosome. In this next step, we greedily prune out a large portion of the SNPs, again mainly done to improve computation time for subsequent analyses, while retaining as much information as possible about the patterns of variation in the sample.

plink --bfile 1kg_new --indep-pairwise 1000 500 0.25
plink --bfile 1kg_new --make-bed --extract plink.prune.in \
  --out 1kg_pruned

After this pruning step, we are left with genotypes for 156,923 SNPs.

Finally, since typical analyses of genetic data assume that none of the samples are closely related, having closely related samples in the data set can sometimes affect the results in unexpected ways. Therefore, for our computations we remove the 29 out of the 31 genotyped samples that are known to be closely related according to this table.

tail -n +2 20140625_related_individuals.txt | cut -f 1 > temp.txt
paste temp.txt temp.txt > samples.txt
plink --bfile 1kg_pruned --make-bed --remove samples.txt \
  --out 1kg_unrelated

In the end, we have genotype data at 156,923 SNPs on chromosomes 1-22 for 2,289 samples.

gaow commented 7 years ago

Cool, thanks a lot @pcarbo I'll make these commands a pipeline and see what's really going on with GTEx data / label.

pcarbo commented 7 years ago

Also, attached are the population labels for the 1000 Genomes samples.

omni_samples.20141118.panel.gz

gaow commented 7 years ago

Well it's a bit embarrassing but I did make a mistake in this notebook that resulted in column swaps for the dbGaP plot. After I fixed it the result is good:

newplot 1

I've also completed my version of the same analysis via MDS, the result are consistent:

newplot

So we are good with the original data. Let me remove the imputed sites some more and try this again (before I have kept some imputed sites). I'll keep this ticket open until verification with 1000G is done.

pcarbo commented 7 years ago

@gaow Definitely better. Projection onto 1000G PCs will be helpful for interpreting the race/ethnicity labels.

gaow commented 7 years ago

Here is MDS analysis after removing all imputed sites:

newplot 2

It looks identical to the analysis on the original data, though. I'll use this data-set for the rest of my analysis.

gaow / mvarbvs