bcm-uga / pcadapt

Performing highly efficient genome scans for local adaptation with R package pcadapt v4
https://bcm-uga.github.io/pcadapt
37 stars 10 forks source link

scores plot with individual names #18

Closed devonorourke closed 4 years ago

devonorourke commented 6 years ago

Now that you've solved my problem of loading files (thanks!!) I was wondering if there's a way to incorporate a few more details in some of the plots. Perhaps it's just a matter of manipulating the available R code you've posted on Github, but I wondered if you've had interest in a few things:

  1. In the example where we can produce a score plot by projecting various principle component axes, it's clear how to get named populations of individuals to appear on the resulting figure. In my sample dataset (uploaded at the end of this thread), I can generate this plot:

image

I would really like to know who those two outliers are way to the left!

What I'd like to have is some sort of function within pcadapt that makes this plot, which actually names the individual points.

image

I've added ggrepel to get the annotations to look a little nicer, but everything about that plot was produced by sifting through the pcadapt object and adding in the necessary individual names using ggplot. Not sure if it'd be a big trouble to add in a function that mimics your example for population names:

poplist.names <- c(rep("POP1", 50),rep("POP2", 50),rep("POP3", 50))

Maybe something like:

individ.names <- c("sample1", "sample2", "sample3", etc...)
  1. In the Manhattan plot, it would be really great to set up a few other components. Perhaps the most important would be to identify which points are the outliers of interest. Another one would be to pass a horizontal line across where you've set your false discovery rate. Here's one example of what I'm thinking:

image

What would be great is to combine the positions.txt file that was produced (maybe in an older version of pcadapt? maybe it's still produced but I'm not seeing it) with the pcadapt object so that you can identify the variants on the plot directly like I've done above. It would also be cool to add other things like gene names or other annotations.

I guess the larger point is finding ways to manipulate the existing pcadapt object. It's easy to make initial plots with, but it would be great to have a few examples in the tutorial where you can add other annotations to the existing dataset.

Thanks for your help and support!

mblumuga commented 6 years ago

Thanks @devonorourke for your comment. True that including name information can be very useful in the score plot. Can you copy on github the R script that generates your PC score plot? Will do the score plot 1st and will see for the Manhattan plot later. Best

fkatharina commented 5 years ago

hello,

may I ask you for your advice on a pcadapt issue I am thinking about? I came across this discussion here and have a question to add:

I ran pcadapt in order to identify outlier loci in my dataset that I then want to remove before further analyses.

Therefore, I would like to generate a list with loci names of all loci that seem to be outliers (I assume the correct way of identifying outliers is to set the criteria p-value < 0.05...)

I assume (because my initial vcf file contains this info as well) that the loci should be ID'd by their own name (=a number) plus the scaffold name and position:

CHROM POS ID scaffold4_cov173 30607 38:405:- etc. in the .vcf.

At the time of loading the vcf data into R, a "positions.txt" file is created in the working directory (however, this contains the position number only, not the CHROM info).

When I run pcadapt and then extract the pvalues from the resulting pcadapt object (writing it into a "pvalues.txt" file), there seems to be no way to directly extract the locus ID info that goes with each respective p-value.

I assume that the order of p-values and the positions extracted from the vcf file in the beginning might be identical, but how can I guarantee this?

Any advice would be much appreciated!

many thanks in advance and best regards,

Katharina

mblumuga commented 5 years ago

Therefore, I would like to generate a list with loci names of all loci that seem to be outliers (I assume the correct way of identifying outliers is to set the criteria p-value < 0.05...)

No, you should read the section "E. Choosing a cutoff for outlier detection" in the pcadapt tutorial

I assume that the order of p-values and the positions extracted from the vcf file in the beginning might be identical, but how can I guarantee this?

Sure, order is unchanged

fkatharina commented 5 years ago

thanks for your feedback.

bensutherland commented 4 years ago

Hi, just wondering if the labeling of pcadapt samples in the scores plot was implemented? If so, how is that done? Thank you!

mblumuga commented 4 years ago

An answer of how to use ggplot for that The code below should work path_to_file <- system.file("extdata", "geno3pops.bed", package = "pcadapt") filename <- read.pcadapt(path_to_file, type = "bed") x <- pcadapt(filename, K = 2) require(ggplot2) aux<-data.frame(pc1=x$scores[,1],pc2=x$scores[,2],names=rep("a",150)) ggplot(aux, aes(x= pc1, y= pc2), label=names) + geom_text(aes(label=names),hjust=0, vjust=0)