Closed Patrick-Bennett closed 5 years ago
Ultimately, I want to take the UNFILTERED, COMBINED VCF file and re-filter it to have 0 missing data instead of 0.20. One of the problems I had was that vcfR was not updated on the cluster since R was updated. I fixed this issue yesterday. HOWEVER, now I am trying to run the filtering locally on R and I am getting this error message:
Error: vector memory exhausted (limit reached?)
At what point on the analysis does the error come up?
I updated to R 3.6.1 and updated my RStudio and now I cannot load any of the packages in my library such as poppr, vcfR, adegenet, etc...It's not good.
But, I think the error was coming up when I tried to create the masks.
Here is the path to the unfiltered vcf file: /nfs1/BPP/LeBoldus_Lab/user_folders/bennetpa/BSRD_popgen/Leptographium_unfiltered_variants.vcf.gz
I want to apply the masks from our filtering script to this file, but I want to make sure that the depth masks were properly applied, and I want to change the 0.20 missingness to 0.00.
I can install all of my R packages, but when I do library(poppr), for instance, I get the following error message related to package "Rcpp"
Loading required package: adegenet Loading required package: ade4 Error: package or namespace load failed for ‘adegenet’ in dyn.load(file, DLLpath = DLLpath, ...): unable to load shared object '/Library/Frameworks/R.framework/Versions/3.6/Resources/library/Rcpp/libs/Rcpp.so': dlopen(/Library/Frameworks/R.framework/Versions/3.6/Resources/library/Rcpp/libs/Rcpp.so, 6): Symbol not found: ___cxa_uncaught_exceptions Referenced from: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libc++.1.dylib Expected in: /usr/lib/libc++abi.dylib in /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libc++.1.dylib Error: package ‘adegenet’ could not be loaded
I have tried to re-install Rcpp and load again but it does not work. I get this error message: Error: package or namespace load failed for ‘Rcpp’ in dyn.load(file, DLLpath = DLLpath, ...): unable to load shared object '/Library/Frameworks/R.framework/Versions/3.6/Resources/library/Rcpp/libs/Rcpp.so': dlopen(/Library/Frameworks/R.framework/Versions/3.6/Resources/library/Rcpp/libs/Rcpp.so, 6): Symbol not found: ___cxa_uncaught_exceptions Referenced from: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libc++.1.dylib Expected in: /usr/lib/libc++abi.dylib in /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libc++.1.dylib
try remove.packages("Rcpp") and reupdate.
Did that. Same error...
What's the output of this: .libPaths()
I think I need to update Xcode?
install.packages("Rcpp", type = "source") trying URL 'https://cran.rstudio.com/src/contrib/Rcpp_1.0.2.tar.gz' Content type 'application/x-gzip' length 3699570 bytes (3.5 MB)
downloaded 3.5 MB
xcode-select --switch path/to/Xcode.app
to specify the Xcode that you wish to use for command line developer tools (or see man xcode-select
)
ERROR: compilation failed for package ‘Rcpp’The downloaded source packages are in ‘/private/var/folders/sx/mc1x4r_s2bn1d6znw74g612w0000gn/T/RtmpV9jYIp/downloaded_packages’
[1] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library"
oh you're in your mac. Then yeah, definitely reinstall or update Xcode and try again
Well that was easy...just updated my MacOS and Xcode and now everything works again...
Alright, try the script again and if there's another error let me know what it is.
Okay here is what's happening:
dp.df[dp.df < 10] <- NA Error: vector memory exhausted (limit reached?)
you might be running out of RAM. Has this happened before? Do you get a result when you run df.df < 10 ?
I have 8GB of RAM on this laptop and have never had a problem before. Should I try to run on R through command line on fangorn?
No. That is a weird problem. What's the output of df.df < 10 ?
dp.df<10 prints an output of TRUE or FALSE for each variant
is lenght(dp.df<10) == length(dp.df) ?
Yes those are equal.
Remember we originally ran the filtering protocol with filtering_vcf.sh on symbiosis. I recently tried to tweak that script and run it but it failed every time...
Hm. I'll check in on my machine in a couple of hours and get back to you.
Okay thanks
Ok so:
Filtering_vcf.R
is supposed to be used in a choromosome by chromosome stage. Its not going to work in the whole dataset because the number of positions are way too many for any RAM. If you want to modify anything to fitler I'd do it on Filtering_vcf.R
, then run Filtering_vcf.sh
, and then you can concatenate all the VCF files into a signle one.However,
I ran the filtering steps with no missing data and you only have 30 positions with NO missing data. However, if you remove the samples with more than 20% missing data (CABL1 ,CABL2 ,CABL3 ,CADC02 ,CADC03 ,CADC06 ,PlantClinic-18 ,STFO7A2-08 ,WRTF5360-10 ,WRTF5480-03 ,WRTF5480-04 ,WRTF5480-05 ,WRTF5481B-02 ,WRTF5481B-03 ,WRTF5481B-15 ,WRTF8800-01 ,WRTF8800-06 ,WRTF8800-11
) you can get 4371 positions without missing data.
for ISSUE #2
: look at /nfs1/BPP/LeBoldus_Lab/user_folders/bennetpa/BSRD_popgen/Combine_vcf.sh
Got it. I see now why it won't work with the combined unfiltered vcf file I made.
So, that's not very good then because I want more than 30 positions, but I also don't really want to remove that many samples I don't think. I'll think about this some more and maybe we can discuss later. Thanks for the help on that issue.
Also are we sure that the masks for depth were applied properly in that script?
Javier F. Tabima R., Ph. D.
Postdoctoral scholar Spatafora Lab LeBoldus Lab
Botany and Plant Pathology Department. Oregon State University, Corvallis, OR.
On Aug 7, 2019, at 3:39 PM, bennetpa notifications@github.com wrote:
Also are we sure that the masks for depth were applied properly in that script?
— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/LeBoldus-Lab/leptographium_popgen/issues/2?email_source=notifications&email_token=AAG3DETYB6DROY3F5MBKJ2TQDNFJ3A5CNFSM4II7W5PKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3Z42TI#issuecomment-519294285, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG3DERLOA3A3EC572WRUHLQDNFJ3ANCNFSM4II7W5PA.
Ok thanks again bud!
Issue #1: Filtering_vcf.R The masks created to filter the variants based on depth look like this:
Step 1
Filtering by DP (DP ranges (5% < x > 95%))
cat("Step 1: Filtering by DP quantiles\n")
Creating quantiles based in the lower 5% and the higher 95%
sums <- apply(dp.vc, 2, function (x) quantile(x, probs=c(0.05, 0.50, 0.95), na.rm = T)) dp.all.2 <- sweep(dp.vc, MARGIN=2, FUN="-", sums[1,]) dp.vc[dp.all.2 <= 0] <- NA dp.all.2 <- sweep(dp.vc, MARGIN=2, FUN="-", sums[3,]) dp.vc[dp.all.2 > 0] <- NA
Creating a mask
dp.df <- dp.vc
Mask based in depth
mask.dp.1 <- (apply(is.na(dp.vc), 1, sum) == ncol(dp.vc)) == F
Maks based in polymorphisms
gt.vc[is.na(dp.df)] <- NA mask.dp.2 <- lapply(apply(gt.vc, 1, unique), function (y) length(na.exclude(y))) != 1 mask.dp <- (mask.dp.1 * mask.dp.2) == 1 #
Step 2
Filtering by minimum DP (10x)
cat("Step 2: Based on minimum DP (DP > 10)", sep = "\n")
Filtering by min DP
dp.df[dp.df < 10] <- NA
Creating a mask
mask_min.dp.1 <- (apply(is.na(dp.df), 1, sum) == ncol(dp.df)) == F
Maks based in polymorphisms
gt.vc[is.na(dp.df)] <- NA mask_min.dp.2 <- lapply(apply(gt.vc, 1, unique), function (y) length(na.exclude(y))) != 1 mask_min.dp <- (mask_min.dp.1 * mask_min.dp.2) == 1 #
The following code saves the masks created in the R script:
Saving all masks
mask.vcf <- cbind(mask.mq, mask.maf, mask.miss) #
filtered.vcf <- vcf[apply(mask.vcf, 1, sum) == 3,]
The depth masks (mask.dp and mask_min.dp) are not saved in the above script Does that mean that the filtering based on depth is not applied at the end of the script???
ISSUE #2 The following code writes a separate VCF file for each isolate, but does not make a combined VCF file that includes all of the data in one Where did the "Leptographium.filtered.vcf.gz" file come from?
fil.vcf <- nrow(filtered.vcf) if (nrow(filtered.vcf) > 0 ){ write.vcf(filtered.vcf, file = paste0("filteredVCF/", vcf.name), mask = F) } else { cat("No variants passed the filters.\n") }
cbind(vcf.name, raw.vcf, poly.vcf, fil.vcf) Issue;