Rosemeis / pcangsd

Framework for analyzing low depth NGS data in heterogeneous populations using PCA.
GNU General Public License v3.0
46 stars 11 forks source link

NJ Tree and PCAdapt outputs #47

Closed emilya21 closed 3 years ago

emilya21 commented 3 years ago

Hello, I am very excited about some of the new features of PCAngsd. I have just run the neighbor-joining tree function and I was hoping to get some clarification of the output files as I could not find much information on this pipeline. This is what I ran:

pcangsd.py -beagle 1.2_snp.swha.maf05.het.beagle.gz -tree -tree_samples samples.txt -out swha.pcangsd.tree -threads 20

It looks like the output files are a covariance matrix and a .tree file. The .tree file looks like it is supposed to be in Newick format? But it consists of my samples names and just 0's

(4596:0.000000, (2074:0.000000, (SK1:0.000000, (1990:0.000000, (4589:0.000000, (3622:0.000000, (4585:0.000000, (SK15:0.000000, (4380:0.000000, (SK10:0.000000, (4714:0.000000, (2070:0.000000, (2088:0.000000, (2072:0.000000, (6306:0.000000, (SK12:0.000000, ....etc

Have I made a mistake or is this an error in the program? Also, can you confirm that this is Newick format and that the intention is that I then plot this tree in R?

Next, I am trying the PCAdapt function. I used the provided script to convert the z-scores to P-values in R, but if I understand correctly, the p-values should be between 0 and 1, and I have p-values ranging all the way up to 2,790. I am slightly familiar with the PCAdapt pipeline so I would also like a slight clarification on exactly how this output compares to what I would get through the PCAdapt program with genotypes. Is "K" in pcadapt.R equivalent to choosing a "K" value in PCAdapt? I tried to adjust it but got an error.

Again, thank you so much for your time and I am so excited at the possibility of being able to run pcadapt on my genotype likelihoods!

Rosemeis commented 3 years ago

Hi,

How many PCs does PCAngsd use in its optimization (it should be automatically detected when not specifying "-e")? The "-e" will also be similar to pcadapt's "K". And yes, the tree feature outputs the NJ tree in Newick tree and the values can be quite low (=0) for close samples. Hopefully it will be differerent between populations. You can try and quickly plot the tree in R to see if it makes sense.

library(ape)
tree <- read.tree("filename.tree"); plot(tree)

And for the pcadapt results it should be fine for the p-values. I'm sorry that it is a bit confusing and I'm working on a tutorial these days. But you would need to run it in following way:

Rscript $PCANGSD/scripts/pcadapt.R filename.pcadapt.zscores.npy new.filename

Then the p-values will be in the new.filename.pval.txt file. If you want a different pcadapt "K", you would unfortunately have to rerun PCAngsd with a different "-e" parameter.

I hope that I have cleared up some stuff. :-)

Best, Jonas

emilya21 commented 3 years ago

Hi Jonas, Thanks so much for getting back to me. The dataset that I was initially using has very little structure and I think this is why the newick tree had mostly zeros. I plotted it in R and it was VERY ugly, but I believe this was due to the nature of my data and not your program! For the PCAdapt function, PCAngsd only used 1 PC in the optimization, but now I understand how to change that if need be. The good news is that I have since run these two functions on a different dataset that has a lot of structure with much more success. The neighbor joining tree looks great in R and I have had success running the pcadapt.R from the command line instead of R studio. PCAdapt ended up finding > 300,000 outliers (roughly equivalent to numbers I have found before).

TL:DR Thank you for creating these new features, after your suggestions they have worked as intended!

Rosemeis commented 3 years ago

Happy to hear that it worked out! :-) Good luck with your project.

Best, Jonas