LeBoldus-Lab / leptographium_popgen

Population genomic analysis for Leptograpium data
0 stars 2 forks source link

R script: Filtering VCF files #2

Closed Patrick-Bennett closed 5 years ago

Patrick-Bennett commented 5 years ago

Issue #1: Filtering_vcf.R The masks created to filter the variants based on depth look like this:

Step 1

Filtering by DP (DP ranges (5% < x > 95%))

cat("Step 1: Filtering by DP quantiles\n")

Creating quantiles based in the lower 5% and the higher 95%

sums <- apply(dp.vc, 2, function (x) quantile(x, probs=c(0.05, 0.50, 0.95), na.rm = T)) dp.all.2 <- sweep(dp.vc, MARGIN=2, FUN="-", sums[1,]) dp.vc[dp.all.2 <= 0] <- NA dp.all.2 <- sweep(dp.vc, MARGIN=2, FUN="-", sums[3,]) dp.vc[dp.all.2 > 0] <- NA

Creating a mask

dp.df <- dp.vc

Mask based in depth

mask.dp.1 <- (apply(is.na(dp.vc), 1, sum) == ncol(dp.vc)) == F

Maks based in polymorphisms

gt.vc[is.na(dp.df)] <- NA mask.dp.2 <- lapply(apply(gt.vc, 1, unique), function (y) length(na.exclude(y))) != 1 mask.dp <- (mask.dp.1 * mask.dp.2) == 1 #

Step 2

Filtering by minimum DP (10x)

cat("Step 2: Based on minimum DP (DP > 10)", sep = "\n")

Filtering by min DP

dp.df[dp.df < 10] <- NA

Creating a mask

mask_min.dp.1 <- (apply(is.na(dp.df), 1, sum) == ncol(dp.df)) == F

Maks based in polymorphisms

gt.vc[is.na(dp.df)] <- NA mask_min.dp.2 <- lapply(apply(gt.vc, 1, unique), function (y) length(na.exclude(y))) != 1 mask_min.dp <- (mask_min.dp.1 * mask_min.dp.2) == 1 #

The following code saves the masks created in the R script:

Saving all masks

mask.vcf <- cbind(mask.mq, mask.maf, mask.miss) #

filtered.vcf <- vcf[apply(mask.vcf, 1, sum) == 3,]

The depth masks (mask.dp and mask_min.dp) are not saved in the above script Does that mean that the filtering based on depth is not applied at the end of the script???

ISSUE #2 The following code writes a separate VCF file for each isolate, but does not make a combined VCF file that includes all of the data in one Where did the "Leptographium.filtered.vcf.gz" file come from?

fil.vcf <- nrow(filtered.vcf) if (nrow(filtered.vcf) > 0 ){ write.vcf(filtered.vcf, file = paste0("filteredVCF/", vcf.name), mask = F) } else { cat("No variants passed the filters.\n") }

cbind(vcf.name, raw.vcf, poly.vcf, fil.vcf) Issue;

Patrick-Bennett commented 5 years ago

Ultimately, I want to take the UNFILTERED, COMBINED VCF file and re-filter it to have 0 missing data instead of 0.20. One of the problems I had was that vcfR was not updated on the cluster since R was updated. I fixed this issue yesterday. HOWEVER, now I am trying to run the filtering locally on R and I am getting this error message:

Error: vector memory exhausted (limit reached?)

Tabima commented 5 years ago

At what point on the analysis does the error come up?

Patrick-Bennett commented 5 years ago

I updated to R 3.6.1 and updated my RStudio and now I cannot load any of the packages in my library such as poppr, vcfR, adegenet, etc...It's not good.

But, I think the error was coming up when I tried to create the masks.

Here is the path to the unfiltered vcf file: /nfs1/BPP/LeBoldus_Lab/user_folders/bennetpa/BSRD_popgen/Leptographium_unfiltered_variants.vcf.gz

I want to apply the masks from our filtering script to this file, but I want to make sure that the depth masks were properly applied, and I want to change the 0.20 missingness to 0.00.

Patrick-Bennett commented 5 years ago

I can install all of my R packages, but when I do library(poppr), for instance, I get the following error message related to package "Rcpp"

Loading required package: adegenet Loading required package: ade4 Error: package or namespace load failed for ‘adegenet’ in dyn.load(file, DLLpath = DLLpath, ...): unable to load shared object '/Library/Frameworks/R.framework/Versions/3.6/Resources/library/Rcpp/libs/Rcpp.so': dlopen(/Library/Frameworks/R.framework/Versions/3.6/Resources/library/Rcpp/libs/Rcpp.so, 6): Symbol not found: ___cxa_uncaught_exceptions Referenced from: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libc++.1.dylib Expected in: /usr/lib/libc++abi.dylib in /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libc++.1.dylib Error: package ‘adegenet’ could not be loaded

I have tried to re-install Rcpp and load again but it does not work. I get this error message: Error: package or namespace load failed for ‘Rcpp’ in dyn.load(file, DLLpath = DLLpath, ...): unable to load shared object '/Library/Frameworks/R.framework/Versions/3.6/Resources/library/Rcpp/libs/Rcpp.so': dlopen(/Library/Frameworks/R.framework/Versions/3.6/Resources/library/Rcpp/libs/Rcpp.so, 6): Symbol not found: ___cxa_uncaught_exceptions Referenced from: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libc++.1.dylib Expected in: /usr/lib/libc++abi.dylib in /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libc++.1.dylib

Tabima commented 5 years ago

try remove.packages("Rcpp") and reupdate.

Patrick-Bennett commented 5 years ago

Did that. Same error...

Tabima commented 5 years ago

What's the output of this: .libPaths()

Patrick-Bennett commented 5 years ago

I think I need to update Xcode?

install.packages("Rcpp", type = "source") trying URL 'https://cran.rstudio.com/src/contrib/Rcpp_1.0.2.tar.gz' Content type 'application/x-gzip' length 3699570 bytes (3.5 MB)

downloaded 3.5 MB

The downloaded source packages are in ‘/private/var/folders/sx/mc1x4r_s2bn1d6znw74g612w0000gn/T/RtmpV9jYIp/downloaded_packages’

Patrick-Bennett commented 5 years ago

[1] "/Library/Frameworks/R.framework/Versions/3.6/Resources/library"

Tabima commented 5 years ago

oh you're in your mac. Then yeah, definitely reinstall or update Xcode and try again

Patrick-Bennett commented 5 years ago

Well that was easy...just updated my MacOS and Xcode and now everything works again...

Tabima commented 5 years ago

Alright, try the script again and if there's another error let me know what it is.

Patrick-Bennett commented 5 years ago

Okay here is what's happening:

dp.df[dp.df < 10] <- NA Error: vector memory exhausted (limit reached?)

Tabima commented 5 years ago

you might be running out of RAM. Has this happened before? Do you get a result when you run df.df < 10 ?

Patrick-Bennett commented 5 years ago

I have 8GB of RAM on this laptop and have never had a problem before. Should I try to run on R through command line on fangorn?

Tabima commented 5 years ago

No. That is a weird problem. What's the output of df.df < 10 ?

Patrick-Bennett commented 5 years ago

dp.df<10 prints an output of TRUE or FALSE for each variant

Tabima commented 5 years ago

is lenght(dp.df<10) == length(dp.df) ?

Patrick-Bennett commented 5 years ago

Yes those are equal.

Remember we originally ran the filtering protocol with filtering_vcf.sh on symbiosis. I recently tried to tweak that script and run it but it failed every time...

Tabima commented 5 years ago

Hm. I'll check in on my machine in a couple of hours and get back to you.

Patrick-Bennett commented 5 years ago

Okay thanks

Tabima commented 5 years ago

Ok so:

  1. Filtering_vcf.R is supposed to be used in a choromosome by chromosome stage. Its not going to work in the whole dataset because the number of positions are way too many for any RAM. If you want to modify anything to fitler I'd do it on Filtering_vcf.R, then run Filtering_vcf.sh, and then you can concatenate all the VCF files into a signle one.

However,

I ran the filtering steps with no missing data and you only have 30 positions with NO missing data. However, if you remove the samples with more than 20% missing data (CABL1 ,CABL2 ,CABL3 ,CADC02 ,CADC03 ,CADC06 ,PlantClinic-18 ,STFO7A2-08 ,WRTF5360-10 ,WRTF5480-03 ,WRTF5480-04 ,WRTF5480-05 ,WRTF5481B-02 ,WRTF5481B-03 ,WRTF5481B-15 ,WRTF8800-01 ,WRTF8800-06 ,WRTF8800-11) you can get 4371 positions without missing data.

Tabima commented 5 years ago

for ISSUE #2: look at /nfs1/BPP/LeBoldus_Lab/user_folders/bennetpa/BSRD_popgen/Combine_vcf.sh

Patrick-Bennett commented 5 years ago

Got it. I see now why it won't work with the combined unfiltered vcf file I made.

So, that's not very good then because I want more than 30 positions, but I also don't really want to remove that many samples I don't think. I'll think about this some more and maybe we can discuss later. Thanks for the help on that issue.

Patrick-Bennett commented 5 years ago

Also are we sure that the masks for depth were applied properly in that script?

Tabima commented 5 years ago

Yeah. I chekced them and they work fine.

Javier F. Tabima R., Ph. D.

Postdoctoral scholar Spatafora Lab LeBoldus Lab

Botany and Plant Pathology Department. Oregon State University, Corvallis, OR.

On Aug 7, 2019, at 3:39 PM, bennetpa notifications@github.com wrote:

Also are we sure that the masks for depth were applied properly in that script?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/LeBoldus-Lab/leptographium_popgen/issues/2?email_source=notifications&email_token=AAG3DETYB6DROY3F5MBKJ2TQDNFJ3A5CNFSM4II7W5PKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3Z42TI#issuecomment-519294285, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG3DERLOA3A3EC572WRUHLQDNFJ3ANCNFSM4II7W5PA.

Patrick-Bennett commented 5 years ago

Ok thanks again bud!