Closed sav0016 closed 1 year ago
That's quite odd. Looking at the error I see Error: scanVcf: scanVcf: scanTabix: (internal) _vcftype_grow 'sz' < 0; cannot allocate memory?
Are you running this on a machine that has access to a large amount of RAM?
Additionally, this step is on the aggregation and R-processing step. Can you manually check to see that the VCF files seem reasonable in size and aren't empty.
Yes, I have about 7200 GB of RAM available. Output/processed_data/rnaVariantCalling/out/sample_vcfs/ - there are 240 folders containing files samplename.g.vcf.gz - 500-1500 MB each. Output/processed_data/rnaVariantCalling/out/batch_vcfs/group1/ - 270000 MB file group1_all_samples.g.vcf.gz
Output/processed_results/rnaVariantCalling/batch_vcfs/group1/group1_v39.annotated.vcf.gz (5 GB). Output/processed_results/rnaVariantCalling/sample_vcfs/group1/ - 240 folders (like 240 samples) containing samplename.vcf.gz (each 15-30 MB). rerun5_rnacalling.log
Attaching log ... Thanks!! J
hmm, I've never seen that before.
if you open R and run the commands
library(VariantAnnotation)
snakemake <- readRDS("/data3/sav0016/drop/puente-new-all/.drop/tmp/RVC/group1/v39_RVC_data_table.Rds")
vcf <- VariantAnnotation::readVcf(snakemake@input$annotatedVCF)
does this work?
vcf <- VariantAnnotation::readVcf(snakemake@input$annotatedVCF)
snakemake <- readRDS("/data3/sav0016/drop/puente-new-all/.drop/tmp/RVC/group1/v39_RVC_data_table.Rds") vcf <- VariantAnnotation::readVcf(snakemake@input$annotatedVCF) Error: scanVcf: scanVcf: scanTabix: (internal) _vcftype_grow 'sz' < 0; cannot allocate memory? path: /data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/batch_vcfs/group1/group1_v39.annotated.vcf.gz index: /data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/batch_vcfs/group1/group1_v39.annotated.vcf.gz.tbi path: /data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/batch_vcfs/group1/group1_v39.annotated.vcf.gz In addition: Warning messages: 1: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames 2: In FUN(X[[i]], ...) : duplicate ID's in header will be forced to unique rownames 3: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames 4: In FUN(X[[i]], ...) : duplicate ID's in header will be forced to unique rownames
That seems to fail due to memory size. Could you try any of the smaller single-sample VCFs?
for example in R
library(VariantAnnotation)
vcf <- VariantAnnotation::readVcf("/data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/sample_vcfs/041-0078-02TR/041-0078-02TR.vcf.gz")
My other suggestion would be to use the gc()
operator in R to use the built-in garbage collector. Maybe that could help
That seems to fail due to memory size. Could you try any of the smaller single-sample VCFs? for example in
R
library(VariantAnnotation) vcf <- VariantAnnotation::readVcf("/data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/sample_vcfs/041-0078-02TR/041-0078-02TR.vcf.gz")
vcf <- VariantAnnotation::readVcf("/data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/sample_vcfs/041-0078-02TR/041-0078-02TR.vcf.gz") Warning messages: 1: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames 2: In FUN(X[[i]], ...) : duplicate ID's in header will be forced to unique rownames 3: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames 4: In FUN(X[[i]], ...) : duplicate ID's in header will be forced to unique rownames
hmm, I've never seen that before.
if you open R and run the commands
library(VariantAnnotation) snakemake <- readRDS("/data3/sav0016/drop/puente-new-all/.drop/tmp/RVC/group1/v39_RVC_data_table.Rds") vcf <- VariantAnnotation::readVcf(snakemake@input$annotatedVCF)
does this work?
I found this one: https://support.bioconductor.org/p/62940/ vcffile = open(VcfFile(fl, yieldSize=10000)) repeat { vcf = readVcf(vcffile, "hg38") if (nrow(vcf) == 0) break
}
Maybe it will be possible to incorporate to script ... Or like here https://www.rdocumentation.org/packages/VariantAnnotation/versions/1.18.5/topics/readVcf using Iterate through VCF with 'yieldSize' section ...
Or maybe it would be possible to use something like this? https://cran.r-project.org/web/packages/vcfR/vcfR.pdf using read.vcfR function ...
What do you think? Thank you very much. J
EDITED: So it does work on the small sample which makes me think it's not a script problem but some memory/R configuration problem are you sure that 7200GB is the RAM, and not the storage? Because that is an immense amount of RAM and shouldn't have any problem with a 5GB VCF file.
did you try tunning gc()
in R to invoke the garbage collector, this seems like a R issue honestly. Can you please share with me the results of these?
gc()
memuse::Sys.meminfo()
Lastly let me know if you find success using the VariantAnnotation yieldSize
option. I haven't used that before, but you should be able to follow their instructions in the link you posted.
EDITED: So it does work on the small sample which makes me think it's not a script problem but some memory/R configuration problem are you sure that 7200GB is the RAM, and not the storage? Because that is an immense amount of RAM and shouldn't have any problem with a 5GB VCF file.
did you try tunning
gc()
in R to invoke the garbage collector, this seems like a R issue honestly. Can you please share with me the results of these?gc() memuse::Sys.meminfo()
Lastly let me know if you find success using the VariantAnnotation
yieldSize
option. I haven't used that before, but you should be able to follow their instructions in the link you posted.
In fact I think it worked on small sample. There is no: "Error: scanVcf: scanVcf: scanTabix: (internal) _vcftype_grow 'sz' < 0; cannot allocate memory?" Only warnings ... So I believe it is not problem with package or R. I tried gc() but it did not help. Yes vcf file is gunzipped about 5GB.
gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 271454 14.5 666600 35.7 410163 22.0 Vcells 543578 4.2 8388608 64.0 1821396 13.9 memuse::Sys.meminfo() Totalram: 7.264 TiB Freeram: 6.912 TiB
I am trying yieldSize option ...
Thanks. J
EDITED: So it does work on the small sample which makes me think it's not a script problem but some memory/R configuration problem are you sure that 7200GB is the RAM, and not the storage? Because that is an immense amount of RAM and shouldn't have any problem with a 5GB VCF file.
did you try tunning
gc()
in R to invoke the garbage collector, this seems like a R issue honestly. Can you please share with me the results of these?gc() memuse::Sys.meminfo()
Lastly let me know if you find success using the VariantAnnotation
yieldSize
option. I haven't used that before, but you should be able to follow their instructions in the link you posted.
I changed this in batch_data_table.R:
fl <- snakemake@input$annotatedVCF vcffile = open(VcfFile(fl, yieldSize=10000)) repeat { vcf = readVcf(vcffile, "hg38") if (nrow(vcf) == 0) break
}
Error: Attaching package: ‘GenomicScores’
The following object is masked from ‘package:utils’:
citation
There were 50 or more warnings (use warnings() to see the first 50) Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'intersect': no seqlevels present in this object Calls: %>% ... seqlevelsStyle -> seqlevelsStyle -> seqlevelsStyle -> seqlevelsStyle Execution halted [Thu Dec 1 22:38:15 2022] Error in rule rnaVariantCalling_pipeline_countVariants_batch_data_table_R: jobid: 3 output: /data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/data_tables/group1/group1_v39_data_table.Rds log: /data3/sav0016/drop/puente-new-all/.drop/tmp/RVC/group1/v39_RVC_data_table.Rds (check log file(s) for error message)
RuleException: CalledProcessErrorin line 359 of /data2/sav0016/tmp/tmpz6hzgeuh: Command 'set -euo pipefail; Rscript --vanilla /data3/sav0016/drop/puente-new-all/.snakemake/scripts/tmpy98kys38.batch_data_table.R' returned non-zero exit status 1. File "/data2/sav0016/tmp/tmpz6hzgeuh", line 359, in __rule_rnaVariantCalling_pipeline_countVariants_batch_data_table_R File "/data2/sav0016/miniconda3/envs/drop_env_latest/lib/python3.10/concurrent/futures/thread.py", line 58, in run Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: .snakemake/log/2022-12-01T185350.574467.snakemake.log
It's really hard to understand with the formatting you've done. But I will look into yieldsize as an option for efficient use, because this seems like the best option. Otherwise, I would try to work with the variant files independently for now.
What is your end goal?
It's really hard to understand with the formatting you've done. But I will look into yieldsize as an option for efficient use, because this seems like the best option. Otherwise, I would try to work with the variant files independently for now.
What is your end goal?
When I used "yieldsize I got this error: Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'intersect': no seqlevels present in this object Calls: %>% ... seqlevelsStyle -> seqlevelsStyle -> seqlevelsStyle -> seqlevelsStyle Execution halted
.... Is final table important? Otherwise I can work with the variant files independently as you mentioned ...
Thanks J
The final table is just an R table that summarizes the output. Which variants are considered rare, or found in which samples. They have no effect on the variant calling. The final VCF tables at this point are finished.
I have pushed a new branch called rvc_yieldSize
which you can install using the following command (or using git). It includes the changes seen here:
pip install git+https://github.com/gagneurlab/drop.git@rvc_yieldSize
Once you install the new features, you will have to run drop update
from the command line to incorporate the changes into your working directory. From there you should be able to just rerun snakemake (I would use -n
first to make sure that it doesn't want to redo all of the variant calling). If it does let me know and I can help craft a more specific command.
This reads the variants in 100k variant batches, which should be more manageable for memory usage. Let me know if this works for you!
I have pushed a new branch called
rvc_yieldSize
which you can install using the following command (or using git). It includes the changes seen here:
pip install git+https://github.com/gagneurlab/drop.git@rvc_yieldSize
Once you install the new features, you will have to run
drop update
from the command line to incorporate the changes into your working directory. From there you should be able to just rerun snakemake (I would use-n
first to make sure that it doesn't want to redo all of the variant calling). If it does let me know and I can help craft a more specific command.This reads the variants in 100k variant batches, which should be more manageable for memory usage. Let me know if this works for you!
I tried pip install as you wrote and drop update, however it did not update the script ... So I manually rewrote the script according to the changes you made.
Different error: rna_Script_modified_git2 (1).log
Thanks J
Thanks for running that. It seems the error is now in a different script (which is good!)
247 It seems your data is too big for client-side DataTables. You may consider server-side processing: https://rstudio.github.io/DT/server.html
I notice on line 57
that we are trying to show the entire data.table. This will be enormous and I have since supset it to the first 1000 in the same branch
57 DT::datatable(
58 head(summary_dt,1000),
59 caption = "Variant filters by GT",
60 options=list(scrollY=TRUE),
61 filter = 'top')
Thanks for your patience, this is the largest dataset the RNA Variant calling module has been used on, so I appreciate your help identifying errors of scale!
Thanks for running that. It seems the error is now in a different script (which is good!)
247 It seems your data is too big for client-side DataTables. You may consider server-side processing: https://rstudio.github.io/DT/server.html
I notice on line
57
that we are trying to show the entire data.table. This will be enormous and I have since supset it to the first 1000 in the same branch57 DT::datatable( 58 head(summary_dt,1000), 59 caption = "Variant filters by GT", 60 options=list(scrollY=TRUE), 61 filter = 'top')
Thanks for your patience, this is the largest dataset the RNA Variant calling module has been used on, so I appreciate your help identifying errors of scale!
I changed:
58 summary_dt), to 58 head(summary_dt,1000),
Did not work. Thanks for helping! I really appreciate it! J
Googling this error makes it seem like we go past the allotted length of an R vector object!! Because you presumably have (on the order of) ~10 million variants across your 240 samples, this command expands the table dimensions from 10million x 240 into a table with dimensions 2.4B x 2 (or something) which is too big!
I will try to think of a better way to fix this in the future, but for now I would just focus on working directly with /data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/data_tables/group1/group1_v39_data_table.Rds
If you want to use R for filtering or identifying at a glance a pattern in variants, or identifying which are rare/common across your cohort.
Either that or focusing on the VCF files themselves.
Can you confirm this suspicion by reporting the following (total variants called across all samples):
zgrep -v '#' /data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/batch_vcfs/group1/group1_v39.annotated.vcf.gz | wc -l
Googling this error makes it seem like we go past the allotted length of an R vector object!! Because you presumably have (on the order of) ~10 million variants across your 240 samples, this command expands the table dimensions from 10million x 240 into a table with dimensions 2.4B x 2 (or something) which is too big!
I will try to think of a better way to fix this in the future, but for now I would just focus on working directly with
/data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/data_tables/group1/group1_v39_data_table.Rds
If you want to use R for filtering or identifying at a glance a pattern in variants, or identifying which are rare/common across your cohort.Either that or focusing on the VCF files themselves.
Can you confirm this suspicion by reporting the following (total variants called across all samples):
zgrep -v '#' /data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/batch_vcfs/group1/group1_v39.annotated.vcf.gz | wc -l
zgrep -v '#' /data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/batch_vcfs/group1/group1_v39.annotated.vcf.gz | wc -l 8988604
I've made another push to rvc_yieldSize
that reads through the results as a batch as well. Would you mind trying to update and test again?
I've made another push to
rvc_yieldSize
that reads through the results as a batch as well. Would you mind trying to update and test again?
rna_Script_modified_git4_results.log
Hi, we have different error ... :-)
Thanks! J
Ok! Did you reinstall and use drop update
or just change the one line in the Scripts/...
function?
If you did just the script change you will have to rename snakmake@config$rnaVariantCalling$yieldSize
to 100000
, since this value is only built using the config file object which is only re-updated with the background updates in the last git push.
If you did reinstall and drop update
, then I will continue to think
Ok! Did you reinstall and use
drop update
or just change the one line in theScripts/...
function?If you did just the script change you will have to rename
snakmake@config$rnaVariantCalling$yieldSize
to100000
, since this value is only built using the config file object which is only re-updated with the background updates in the last git push.If you did reinstall and
drop update
, then I will continue to think
Thank you. This one passed. J
Hurray! Thanks so much for your patience, and I hope your DROP results are great :)
Hi, I'm getting this error during rnavariantcalling (240 samples):
Attaching package: ‘GenomicScores’
The following object is masked from ‘package:utils’:
Error: scanVcf: scanVcf: scanTabix: (internal) _vcftype_grow 'sz' < 0; cannot allocate memory? path: /data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/batch_vcfs/group1/group1_v39.annotated.vcf.gz index: /data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/batch_vcfs/group1/group1_v39.annotated.vcf.gz.tbi path: /data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/batch_vcfs/group1/group1_v39.annotated.vcf.gz In addition: Warning messages: 1: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames 2: In FUN(X[[i]], ...) : duplicate ID's in header will be forced to unique rownames 3: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames 4: In FUN(X[[i]], ...) : duplicate ID's in header will be forced to unique rownames Execution halted [Tue Nov 29 21:15:50 2022] Error in rule rnaVariantCalling_pipeline_countVariants_batch_data_table_R: jobid: 3 output: /data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/data_tables/group1/group1_v39_data_table.Rds log: /data3/sav0016/drop/puente-new-all/.drop/tmp/RVC/group1/v39_RVC_data_table.Rds (check log file(s) for error message)
RuleException: CalledProcessErrorin line 359 of /data2/sav0016/tmp/tmpi5xnott2: Command 'set -euo pipefail; Rscript --vanilla /data3/sav0016/drop/puente-new-all/.snakemake/scripts/tmptvewc96c.batch_data_table.R' returned non-zero exit status 1. File "/data2/sav0016/tmp/tmpi5xnott2", line 359, in __rule_rnaVariantCalling_pipeline_countVariants_batch_data_table_R File "/data2/sav0016/miniconda3/envs/drop_env_latest/lib/python3.10/concurrent/futures/thread.py", line 58, in run Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message
Any ideas please? Thank you J