alachins / raisd

RAiSD: software to detect positive selection based on multiple signatures of a selective sweep and SNP vectors
33 stars 13 forks source link

Manhattan plot not created #12

Closed TinaH10 closed 4 years ago

TinaH10 commented 4 years ago

Hello,

I have tried everything in my power to run RAiSD on my dataset. It works fine for everything but using the -A option to generate the Manhattan plot. I have used both the zip file capable version and the regular version, with and without a random number. The executable is in my path. But I am not sure I understand what the manual says with:

"...to be installed and Rscript to be in the default $PATH"

The software does generate an Rscript, but it does not seem the right one for the Manhattan plot. It is the one with the 4 plots on the same page instead. But it is never actually executed with the -A option.

I just don't know what I am missing. This is my command:

./RAiSD -n GER_RE_runZIP -S ~/RAiSD/raisd/SAMPLES_ORIG/GER_RE_SAMPLES -I /usr/users/bharr/POPGENOME_DATA_ANALYSIS/AllMouse.vcf_90_recalibrated_snps_raw_indels_reheader_PopSorted.PASS.vcf -f -A 0.995

Thank you very much for your help!

alachins commented 4 years ago

Hello,

When you call RAiSD, do you see the following line (either in the terminal, or in the info file)?

Rscript: R scripting front-end version 3.6.0 (2019-04-26)

(The version might differ!)

Nikos

TinaH10 commented 4 years ago

Yes:

RAiSD, Raised Accuracy in Sweep Detection Copyright (C) 2017, and GNU GPL'd, by Nikolaos Alachiotis and Pavlos Pavlidis Contact n.alachiotis/pavlidisp at gmail.com

Command: ./RAiSD -n CR -S /usr/users/bharr/RAiSD/raisd/SAMPLES_ORIG/CR_SAMPLES -I /usr/users/bharr/POPGENOME_DATA_ANALYSIS/AllMouse.vcf_90_recalibrated_snps_raw_indels.vcf.gz___chr1_PASS_SNP.recode.vcf.gz -a 356921 -f -A 0.995 Samples: 8 [Total: 67, Not found: 0, Requested 8] Format: vcf.gz Rscript: R scripting front-end version 3.5.1 (2018-07-02)

A pattern structure of 65536 patterns (max. capacity) and approx. 2 MB memory footprint has been created.

Sets (total): 1 Sets (processed): 1 Sets (skipped): 0

Total execution time 373.02104 seconds Total memory footprint 6787355 kbytes

alachins commented 4 years ago

Can you check whether there are scores calculated in the report file? Since RAiSD finds the R scripting front-end and generates the Rscript to be called, my guess is that there are no calculated scores in the report file. The info file must include one additional line, can you provide it?

TinaH10 commented 4 years ago

yes, all scores are calculated:

cat RAiSD_Info.CR

RAiSD, Raised Accuracy in Sweep Detection Copyright (C) 2017, and GNU GPL'd, by Nikolaos Alachiotis and Pavlos Pavlidis Contact n.alachiotis/pavlidisp at gmail.com

Command: ./RAiSD -n CR -S /usr/users/bharr/RAiSD/raisd/SAMPLES_ORIG/CR_SAMPLES -I /usr/users/bharr/POPGENOME_DATA_ANALYSIS/AllMouse.vcf_90_recalibrated_snps_raw_indels.vcf.gz___chr1_PASS_SNP.recode.vcf.gz -a 356921 -f -A 0.995 Samples: 8 [Total: 67, Not found: 0, Requested 8] Format: vcf.gz Rscript: R scripting front-end version 3.5.1 (2018-07-02)

A pattern structure of 65536 patterns (max. capacity) and approx. 2 MB memory footprint has been created.

0: Set chr1 | sites 11171705 | snps 1088197 | region 195371772 - Var 85339568 2.661e+01 | SFS 5865222 3.318e+00 | LD 175311792 2.778e+01 | MuStat 130359424 1.545e+02 Sets (total): 1 Sets (processed): 1 Sets (skipped): 0

Total execution time 373.02104 seconds Total memory footprint 6787355 kbytes

TinaH10 commented 4 years ago

gwdu103:33 16:35:44 ~/RA_TEST/raisd > head RAiSD_Report.CR.chr1 3013792 3000321 3027262 3.001e+00 6.636e-01 4.706e-01 9.373e-01 3013931 3000413 3027449 3.012e+00 6.636e-01 4.375e-01 8.744e-01 3014079 3000664 3027494 2.989e+00 6.636e-01 3.603e-01 7.146e-01 3014524 3001106 3027941 2.989e+00 7.300e-01 5.478e-01 1.195e+00 3015450 3001300 3029600 3.153e+00 6.636e-01 6.926e-01 1.449e+00 3017154 3001492 3032817 3.490e+00 6.636e-01 8.078e-01 1.871e+00 3018286 3003459 3033113 3.303e+00 7.300e-01 1.201e+00 2.896e+00 3018798 3003643 3033952 3.376e+00 7.300e-01 1.027e+00 2.532e+00 3019043 3003698 3034388 3.419e+00 7.300e-01 9.643e-01 2.407e+00 3019707 3003888 3035526 3.524e+00 6.636e-01 1.379e+00 3.227e+00

TinaH10 commented 4 years ago

this particular case only has one chromosome, but I have also tried it with several chromosomes.

TinaH10 commented 4 years ago

when I don't use the -A option, the plots are properly generated RAiSD_Plot.test_run.chrY.pdf

TinaH10 commented 4 years ago

what is the exact name of the R script to be called for the Manhattan plot. I don't think that file is actually generated.

TinaH10 commented 4 years ago

can you send me the code of the Rscript for the Manhattan plot, so that I can see if this file is generated?

alachins commented 4 years ago

The Rscript file is named RSDPlot_RUNNAME.R, and is immediately deleted after plot generation. You can comment out line 369 in file RAiSD.c to prevent its deletion. Simply add // in front of the RSDPlot_removeRscript function. In any case, if you can send me the input files I can look into it in more detail.

This is the code of the Rscript that generates the Manhattan plot:

args = commandArgs(trailingOnly=TRUE) library(qqman) RUNNAME <- args[1] THRESHOLD <- args[2] reportListN <- paste("RAiSD_ReportList.", RUNNAME,".txt", sep="") reportList <- read.table(reportListN)[,1] data <- data.frame(pos=c(), value=c(), chr=c()) for (i in 1:length(reportList[])) { d <- read.table(paste(reportList[i]), header=F, skip=0)[,1:7] tmp.dat <- data.frame(pos=d[,1], value=d[,7]1, chr=rep(i, length(d[,1]))) data <- rbind(data, tmp.dat)} output_file <- paste("RAiSD_ManhattanPlot.", RUNNAME,".pdf", sep="") pdf(output_file) snp <- 1:dim(data)[1] mydf <- data.frame(snp, data) thres<-as.numeric(THRESHOLD) topQ<-thres100 threshold <- quantile(x=data$value, probs = thres) title_msg <- paste("Manhattan plot for ", RUNNAME," threshold=",threshold," (", topQ,"%)") manhattan(mydf, chr="chr", bp="pos", cex = 0.5, p="value", snp="snp", logp=F, ylab=bquote(mu ~ "statistic"), genomewideline = FALSE, suggestiveline = FALSE,pos=0, col=c("blue2", "darkorange1"), ylim=c(0.0, max(data$value, na.rm = TRUE)*1.4)) title(title_msg) abline(h=threshold, col="red", lw=2) dev.off()

TinaH10 commented 4 years ago

ok, thanks. This script is never generated. Only the other one, for the 4 plots on the same page. Even when using the -A option. I saw that it gets deleted after plot generation, that's why I check it during the run. And this script above is never generated.

You can download the infile here:

Please use this one: And as the samples file these:

Mmm_CZE1_CR12.variant5 Mmm_CZE2_CR13.variant5 Mmm_CZE3_CR16.variant5 Mmm_CZE4_CR23.variant5 Mmm_CZE5_CR25.variant5 Mmm_CZE6_CR29.variant5 Mmm_CZE7_CR46.variant5 Mmm_CZE8_CR59.variant5

Thanks!

alachins commented 4 years ago

I called the latest version of RAiSD as follows:

./RAiSD -n CR -S sampleFile.txt -I AllMouse.vcf_90_recalibrated_snps_raw_indels_reheader_PopSorted.PASS.vcf.gz -a 356921 -f -A 0.995

(I only let it run for the first 10 chromosomes)

The Rscript is being generated but just right before creating the plot and then it gets deleted immediately. So its unlikely that you will notice its generation. I am attaching it to use separately if you want. RSDPlot_CR.R.txt

I am also sending the generated manhattan plot for the 10 chromosomes. RAiSD_ManhattanPlot.CR.pdf

Note that, due to a corner case in a memory-reduction optimization, some of the generated scores might be inflated. This is because chr1 has a large number of SNPs (over 1,000,000 SNPs). This is going to be fixed in the next RAiSD release. I can provide a quick fix, adapted for your data, in the next days.

TinaH10 commented 4 years ago

thank you. This is so weird. Why do you suggest is the Manhattan plot not generated for me then? Could it have to do with the R version that my cluster has?

Or that I did not set the PATH properly?

I did download the latest version or RAiSD.

Thank you again, also for the quick fix. Will be looking forward to it!

TinaH10 commented 4 years ago

Ahhhh! I found the problem!!

R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.

library(qqman) Error in library(qqman) : there is no package called ‘qqman’

My R does not have qqman installed!!

TinaH10 commented 4 years ago

Note that, due to a corner case in a memory-reduction optimization, some of the generated scores might be inflated. This is because chr1 has a large number of SNPs (over 1,000,000 SNPs). This is going to be fixed in the next RAiSD release. I can provide a quick fix, adapted for your data, in the next days.

Hello, I was just wondering if there is any news on the quick fix for the inflated scores. I would like to move on with this project if at all possible. I would also be happy to calculate new scores myself from the outfiles I have obtained, if you could let me know how to do this. Thank you very much for your help.

alachins commented 4 years ago

I just repeated part of the runs for your data. It seems that the memory-reduction optimization is not triggering the corner case I mentioned in my previous message because the sample list you use it rather small (8 samples). How big in terms of number of samples is the largest sample list you need to use with your datasets?

TinaH10 commented 4 years ago

thank you. The largest population is 12 individuals.

alachins commented 4 years ago

Okay. This will definitely trigger the corner case I mentioned. You can do the following: In file RAiSD.h (sources folder), replace line 64 with:

define PATTERNPOOL_SIZE 200

(Essentially change 1 to 200)

This will make RAiSD run slightly slower, which will be properly fixed to run fast in an upcoming RAiSD release.

Also, you can include a check in RAiSD.c under line 231. Copy-paste this: assert(RSDChunk->chunkID==0);

Then "make clean -f MakefileZLIB" and then "make -f MakefileZLIB" again.

TinaH10 commented 4 years ago

ok, thank you. I'll do that!

TinaH10 commented 4 years ago

oh, but just to be clear, for the populations where I have just 8 individuals, I can run the original version?

alachins commented 4 years ago

yes, for anything larger than that use the above modifications.

TinaH10 commented 4 years ago

thank you will do. One more question, maybe outside the scope of your software. I have a hard time viewing the manhattan plot pdf, it is huge and contains many points. Is there a way (maybe within R) to downscale the size of this file?

alachins commented 4 years ago

I usually use mupdf to open large pdf files. It handles larger files better. Also you can convert it to ps and then back to pdf using ps2pdf and the following arg: -dPDFSETTINGS=/screen It works well most of the times. I am not sure about what could be done within R, unless you mean modifying the R script.

caonetto commented 3 years ago

Just wondering if this problem with R ever got resolved? I am using the -A option however the plot is not being generated and the temporal R script does not contain the details to produce the actual plots with the quantile values. The run seems to finish normally it just doesnt create the appropiate plots. Qqman is installed by the way.

Thanks.

alachins commented 3 years ago

As far as I know the -A option works fine. Can you use the R script (provided in this thread) separately, after the run, to generate a plot for some random quantile value? If yes, I can check how to make RAiSD print the threshold needed by the rscript.

caonetto commented 3 years ago

Ive managed to make it work locally. Thanks!

tbringloe commented 3 years ago

I had a similar issue when trying to get Manhattan plots. In case this helps, the R script loop was breaking due to a couple reports with no output (must have been no SNPs on those scaffolds?). After removing those (empty) reports from my report list the R script worked as expected.