feature request: extracting SNP positions; modifying plots

devonorourke commented 6 years ago

Hello - love the program.

I was wondering if there is a sensible way to extract the chromosome and base pair position within the list generated in the output provided in the documentation. I was thinking about Fig. 5 in your earlier paper where you subsetted a Manhattan plot by chromosome.
Is it possible to generate a Manhattan plot that substitutes the p-value with a different vector from the object output from x <- pcadapt(filename, K = 2)? Again with respect to that earlier Fig. 5, you've plotted the loadings values instead of p-values; is there a single line to modify within the plot script to adjust for this?

Thanks!

keurcien commented 6 years ago

The new version will allow doing this when using PLINK format (as SNP informations are stored in the .bim file). That'll make things easier for everyone I guess.
That can be done, I need to think about how we implement this feature, probably an extra argument in the plot function. What we would like is to make the plots customizable, while avoiding users to care about the colorization.

Thanks for using GitHub, very convenient for us to have these requests listed here.

devonorourke commented 6 years ago

Thanks for the quick reply, 1) I'm sure PLINK file type conversion will be very helpful especially if you can start to provide more functionality back to the user in terms of plotting. Looking at the chromoqc function within the vcfR package, it would prove very useful to not only position these outputs along one or more chromosomes, but to be able to layer the outputs. 2) Given that the plotting outputs are part of ggplot, am I mistaken, or can you just pass the regular arguments into the function as if it's a ggplot-derived object in your output? I haven't tried this but was wondering where and if I could pass something like facet_wrap, or scale_y_continuous (whether internally within the R script called, or externally using the argument to call the R script)?

I'd imagine there is always a tradeoff between how many features you provide the user versus bugs creeping into to deal with all the features; however if all you're doing is providing a wrapper to connect the listed object to ggplot, I don't see as big a complication.

I'm used to working with data.frames when thinking about how to structure my data; in terms of future data visualization, I wonder if it would be possible to convert aspects of the list into a data.frame object so that one could more easily view portions of that pcadapt object and determine things like:

where and how many SNPs fall within the top 10% of their outliers on a per-chromosome basis
where and how many SNPs fall within the top 10% of their outliers along any chromosome
how many of these SNPs are nonsynonymous/synonymous; how many are in coding regions; etc.

Thanks!

devonorourke commented 6 years ago

Hi again, So a bit of a follow up after messing around with the object produced from x <- pcadapt(...). I pulled out a few items with the plyr package and reformatted into a data.frame:

mydf <- ldply(x[10], data.frame)
mymaf <- ldply(x[5], data.frame)
mystat <- ldply(x[7], data.frame)

Then a little reformatting:

mydf <- cbind(mypval, mymaf, mystat)
mydf <- mydf[,c(2,4,6)]
colnames(mydf) <- c("pval", "maf", "stat")

I then threw in the positions from one of your outputted files (this wasn't there last night - did you change something already?):

pos.df <- read.table("positions.txt")
finaldf <- rbind(pos.df, mydf)

And finally did the -log(10) scale for the p-values:

finaldf$nlog10pval <- -log10(mypval$X..i..)

I could then use ggplot directly to (1) put the SNPs on their actual genomic positions, and (2) set up a structure whereby I highlight the positions of the SNPs on the plot I might want to dig into further (in this case it was just their bp position, but with a more metadata in the data.frame I'd probably add some GFF annotations per position to figure out whether or not these are in genes, exons, etc.):

library(ggplot2)
p <- ggplot(finaldf, aes(bp,nlog10pval))
p1 <- p + geom_point() + geom_text(data=subset(finaldf, nlog10pval > 15), aes(bp,nlog10pval,label=bp))
ggsave("p1.pdf")

And volia. Not pretty, but it's a rough approximation of what I'm hoping we can do with your tools.

bcm-uga / pcadapt

feature request: extracting SNP positions; modifying plots #7