gagneurlab / drop

Pipeline to find aberrant events in RNA-Seq data, useful for diagnosis of rare disorders
MIT License
129 stars 43 forks source link

Error: scanVcf: scanVcf: scanTabix: (internal) _vcftype_grow 'sz' < 0; cannot allocate memory? #401

Closed sav0016 closed 1 year ago

sav0016 commented 1 year ago

Hi, I'm getting this error during rnavariantcalling (240 samples):

Attaching package: ‘GenomicScores’

The following object is masked from ‘package:utils’:

citation

Error: scanVcf: scanVcf: scanTabix: (internal) _vcftype_grow 'sz' < 0; cannot allocate memory? path: /data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/batch_vcfs/group1/group1_v39.annotated.vcf.gz index: /data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/batch_vcfs/group1/group1_v39.annotated.vcf.gz.tbi path: /data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/batch_vcfs/group1/group1_v39.annotated.vcf.gz In addition: Warning messages: 1: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames 2: In FUN(X[[i]], ...) : duplicate ID's in header will be forced to unique rownames 3: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames 4: In FUN(X[[i]], ...) : duplicate ID's in header will be forced to unique rownames Execution halted [Tue Nov 29 21:15:50 2022] Error in rule rnaVariantCalling_pipeline_countVariants_batch_data_table_R: jobid: 3 output: /data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/data_tables/group1/group1_v39_data_table.Rds log: /data3/sav0016/drop/puente-new-all/.drop/tmp/RVC/group1/v39_RVC_data_table.Rds (check log file(s) for error message)

RuleException: CalledProcessErrorin line 359 of /data2/sav0016/tmp/tmpi5xnott2: Command 'set -euo pipefail; Rscript --vanilla /data3/sav0016/drop/puente-new-all/.snakemake/scripts/tmptvewc96c.batch_data_table.R' returned non-zero exit status 1. File "/data2/sav0016/tmp/tmpi5xnott2", line 359, in __rule_rnaVariantCalling_pipeline_countVariants_batch_data_table_R File "/data2/sav0016/miniconda3/envs/drop_env_latest/lib/python3.10/concurrent/futures/thread.py", line 58, in run Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message

Any ideas please? Thank you J

nickhsmith commented 1 year ago

That's quite odd. Looking at the error I see Error: scanVcf: scanVcf: scanTabix: (internal) _vcftype_grow 'sz' < 0; cannot allocate memory? Are you running this on a machine that has access to a large amount of RAM?

Additionally, this step is on the aggregation and R-processing step. Can you manually check to see that the VCF files seem reasonable in size and aren't empty.

sav0016 commented 1 year ago

Yes, I have about 7200 GB of RAM available. Output/processed_data/rnaVariantCalling/out/sample_vcfs/ - there are 240 folders containing files samplename.g.vcf.gz - 500-1500 MB each. Output/processed_data/rnaVariantCalling/out/batch_vcfs/group1/ - 270000 MB file group1_all_samples.g.vcf.gz

Output/processed_results/rnaVariantCalling/batch_vcfs/group1/group1_v39.annotated.vcf.gz (5 GB). Output/processed_results/rnaVariantCalling/sample_vcfs/group1/ - 240 folders (like 240 samples) containing samplename.vcf.gz (each 15-30 MB). rerun5_rnacalling.log

Attaching log ... Thanks!! J

nickhsmith commented 1 year ago

hmm, I've never seen that before.

if you open R and run the commands

library(VariantAnnotation)
snakemake <- readRDS("/data3/sav0016/drop/puente-new-all/.drop/tmp/RVC/group1/v39_RVC_data_table.Rds")
vcf <- VariantAnnotation::readVcf(snakemake@input$annotatedVCF)

does this work?

sav0016 commented 1 year ago

vcf <- VariantAnnotation::readVcf(snakemake@input$annotatedVCF)

snakemake <- readRDS("/data3/sav0016/drop/puente-new-all/.drop/tmp/RVC/group1/v39_RVC_data_table.Rds") vcf <- VariantAnnotation::readVcf(snakemake@input$annotatedVCF) Error: scanVcf: scanVcf: scanTabix: (internal) _vcftype_grow 'sz' < 0; cannot allocate memory? path: /data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/batch_vcfs/group1/group1_v39.annotated.vcf.gz index: /data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/batch_vcfs/group1/group1_v39.annotated.vcf.gz.tbi path: /data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/batch_vcfs/group1/group1_v39.annotated.vcf.gz In addition: Warning messages: 1: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames 2: In FUN(X[[i]], ...) : duplicate ID's in header will be forced to unique rownames 3: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames 4: In FUN(X[[i]], ...) : duplicate ID's in header will be forced to unique rownames

nickhsmith commented 1 year ago

That seems to fail due to memory size. Could you try any of the smaller single-sample VCFs? for example in R

library(VariantAnnotation)
vcf <- VariantAnnotation::readVcf("/data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/sample_vcfs/041-0078-02TR/041-0078-02TR.vcf.gz")
nickhsmith commented 1 year ago

My other suggestion would be to use the gc() operator in R to use the built-in garbage collector. Maybe that could help

sav0016 commented 1 year ago

That seems to fail due to memory size. Could you try any of the smaller single-sample VCFs? for example in R

library(VariantAnnotation)
vcf <- VariantAnnotation::readVcf("/data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/sample_vcfs/041-0078-02TR/041-0078-02TR.vcf.gz")

vcf <- VariantAnnotation::readVcf("/data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/sample_vcfs/041-0078-02TR/041-0078-02TR.vcf.gz") Warning messages: 1: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames 2: In FUN(X[[i]], ...) : duplicate ID's in header will be forced to unique rownames 3: In .bcfHeaderAsSimpleList(header) : duplicate keys in header will be forced to unique rownames 4: In FUN(X[[i]], ...) : duplicate ID's in header will be forced to unique rownames

sav0016 commented 1 year ago

hmm, I've never seen that before.

if you open R and run the commands

library(VariantAnnotation)
snakemake <- readRDS("/data3/sav0016/drop/puente-new-all/.drop/tmp/RVC/group1/v39_RVC_data_table.Rds")
vcf <- VariantAnnotation::readVcf(snakemake@input$annotatedVCF)

does this work?

I found this one: https://support.bioconductor.org/p/62940/ vcffile = open(VcfFile(fl, yieldSize=10000)) repeat { vcf = readVcf(vcffile, "hg38") if (nrow(vcf) == 0) break

do work on chunk

}

Maybe it will be possible to incorporate to script ... Or like here https://www.rdocumentation.org/packages/VariantAnnotation/versions/1.18.5/topics/readVcf using Iterate through VCF with 'yieldSize' section ...

Or maybe it would be possible to use something like this? https://cran.r-project.org/web/packages/vcfR/vcfR.pdf using read.vcfR function ...

What do you think? Thank you very much. J

nickhsmith commented 1 year ago

EDITED: So it does work on the small sample which makes me think it's not a script problem but some memory/R configuration problem are you sure that 7200GB is the RAM, and not the storage? Because that is an immense amount of RAM and shouldn't have any problem with a 5GB VCF file.

did you try tunning gc() in R to invoke the garbage collector, this seems like a R issue honestly. Can you please share with me the results of these?

gc()
memuse::Sys.meminfo()

Lastly let me know if you find success using the VariantAnnotation yieldSize option. I haven't used that before, but you should be able to follow their instructions in the link you posted.

sav0016 commented 1 year ago

EDITED: So it does work on the small sample which makes me think it's not a script problem but some memory/R configuration problem are you sure that 7200GB is the RAM, and not the storage? Because that is an immense amount of RAM and shouldn't have any problem with a 5GB VCF file.

did you try tunning gc() in R to invoke the garbage collector, this seems like a R issue honestly. Can you please share with me the results of these?

gc()
memuse::Sys.meminfo()

Lastly let me know if you find success using the VariantAnnotation yieldSize option. I haven't used that before, but you should be able to follow their instructions in the link you posted.

In fact I think it worked on small sample. There is no: "Error: scanVcf: scanVcf: scanTabix: (internal) _vcftype_grow 'sz' < 0; cannot allocate memory?" Only warnings ... So I believe it is not problem with package or R. I tried gc() but it did not help. Yes vcf file is gunzipped about 5GB.

gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 271454 14.5 666600 35.7 410163 22.0 Vcells 543578 4.2 8388608 64.0 1821396 13.9 memuse::Sys.meminfo() Totalram: 7.264 TiB Freeram: 6.912 TiB

I am trying yieldSize option ...

Thanks. J

sav0016 commented 1 year ago

EDITED: So it does work on the small sample which makes me think it's not a script problem but some memory/R configuration problem are you sure that 7200GB is the RAM, and not the storage? Because that is an immense amount of RAM and shouldn't have any problem with a 5GB VCF file.

did you try tunning gc() in R to invoke the garbage collector, this seems like a R issue honestly. Can you please share with me the results of these?

gc()
memuse::Sys.meminfo()

Lastly let me know if you find success using the VariantAnnotation yieldSize option. I haven't used that before, but you should be able to follow their instructions in the link you posted.

I changed this in batch_data_table.R:

vcf <- VariantAnnotation::readVcf(snakemake@input$annotatedVCF) # read the batch vcf

fl <- snakemake@input$annotatedVCF vcffile = open(VcfFile(fl, yieldSize=10000)) repeat { vcf = readVcf(vcffile, "hg38") if (nrow(vcf) == 0) break

do work on chunk

}

Error: Attaching package: ‘GenomicScores’

The following object is masked from ‘package:utils’:

citation

There were 50 or more warnings (use warnings() to see the first 50) Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'intersect': no seqlevels present in this object Calls: %>% ... seqlevelsStyle -> seqlevelsStyle -> seqlevelsStyle -> seqlevelsStyle Execution halted [Thu Dec 1 22:38:15 2022] Error in rule rnaVariantCalling_pipeline_countVariants_batch_data_table_R: jobid: 3 output: /data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/data_tables/group1/group1_v39_data_table.Rds log: /data3/sav0016/drop/puente-new-all/.drop/tmp/RVC/group1/v39_RVC_data_table.Rds (check log file(s) for error message)

RuleException: CalledProcessErrorin line 359 of /data2/sav0016/tmp/tmpz6hzgeuh: Command 'set -euo pipefail; Rscript --vanilla /data3/sav0016/drop/puente-new-all/.snakemake/scripts/tmpy98kys38.batch_data_table.R' returned non-zero exit status 1. File "/data2/sav0016/tmp/tmpz6hzgeuh", line 359, in __rule_rnaVariantCalling_pipeline_countVariants_batch_data_table_R File "/data2/sav0016/miniconda3/envs/drop_env_latest/lib/python3.10/concurrent/futures/thread.py", line 58, in run Shutting down, this might take some time. Exiting because a job execution failed. Look above for error message Complete log: .snakemake/log/2022-12-01T185350.574467.snakemake.log

nickhsmith commented 1 year ago

It's really hard to understand with the formatting you've done. But I will look into yieldsize as an option for efficient use, because this seems like the best option. Otherwise, I would try to work with the variant files independently for now.

What is your end goal?

sav0016 commented 1 year ago

It's really hard to understand with the formatting you've done. But I will look into yieldsize as an option for efficient use, because this seems like the best option. Otherwise, I would try to work with the variant files independently for now.

What is your end goal?

When I used "yieldsize I got this error: Error in h(simpleError(msg, call)) : error in evaluating the argument 'x' in selecting a method for function 'intersect': no seqlevels present in this object Calls: %>% ... seqlevelsStyle -> seqlevelsStyle -> seqlevelsStyle -> seqlevelsStyle Execution halted

.... Is final table important? Otherwise I can work with the variant files independently as you mentioned ...

Thanks J

nickhsmith commented 1 year ago

The final table is just an R table that summarizes the output. Which variants are considered rare, or found in which samples. They have no effect on the variant calling. The final VCF tables at this point are finished.

nickhsmith commented 1 year ago

I have pushed a new branch called rvc_yieldSize which you can install using the following command (or using git). It includes the changes seen here:

pip install git+https://github.com/gagneurlab/drop.git@rvc_yieldSize

Once you install the new features, you will have to run drop update from the command line to incorporate the changes into your working directory. From there you should be able to just rerun snakemake (I would use -n first to make sure that it doesn't want to redo all of the variant calling). If it does let me know and I can help craft a more specific command.

This reads the variants in 100k variant batches, which should be more manageable for memory usage. Let me know if this works for you!

sav0016 commented 1 year ago

I have pushed a new branch called rvc_yieldSize which you can install using the following command (or using git). It includes the changes seen here:

pip install git+https://github.com/gagneurlab/drop.git@rvc_yieldSize

Once you install the new features, you will have to run drop update from the command line to incorporate the changes into your working directory. From there you should be able to just rerun snakemake (I would use -n first to make sure that it doesn't want to redo all of the variant calling). If it does let me know and I can help craft a more specific command.

This reads the variants in 100k variant batches, which should be more manageable for memory usage. Let me know if this works for you!

I tried pip install as you wrote and drop update, however it did not update the script ... So I manually rewrote the script according to the changes you made.

Different error: rna_Script_modified_git2 (1).log

Thanks J

nickhsmith commented 1 year ago

Thanks for running that. It seems the error is now in a different script (which is good!)

247   It seems your data is too big for client-side DataTables. You may consider server-side processing: https://rstudio.github.io/DT/server.html

I notice on line 57 that we are trying to show the entire data.table. This will be enormous and I have since supset it to the first 1000 in the same branch

57 DT::datatable(
58     head(summary_dt,1000),
59     caption = "Variant filters by GT",
60     options=list(scrollY=TRUE),
61     filter = 'top')

Thanks for your patience, this is the largest dataset the RNA Variant calling module has been used on, so I appreciate your help identifying errors of scale!

sav0016 commented 1 year ago

Thanks for running that. It seems the error is now in a different script (which is good!)

247   It seems your data is too big for client-side DataTables. You may consider server-side processing: https://rstudio.github.io/DT/server.html

I notice on line 57 that we are trying to show the entire data.table. This will be enormous and I have since supset it to the first 1000 in the same branch

57 DT::datatable(
58     head(summary_dt,1000),
59     caption = "Variant filters by GT",
60     options=list(scrollY=TRUE),
61     filter = 'top')

Thanks for your patience, this is the largest dataset the RNA Variant calling module has been used on, so I appreciate your help identifying errors of scale!

rna_Script_modified_git3.log

I changed:

58 summary_dt), to 58 head(summary_dt,1000),

Did not work. Thanks for helping! I really appreciate it! J

nickhsmith commented 1 year ago

Googling this error makes it seem like we go past the allotted length of an R vector object!! Because you presumably have (on the order of) ~10 million variants across your 240 samples, this command expands the table dimensions from 10million x 240 into a table with dimensions 2.4B x 2 (or something) which is too big!

I will try to think of a better way to fix this in the future, but for now I would just focus on working directly with /data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/data_tables/group1/group1_v39_data_table.Rds If you want to use R for filtering or identifying at a glance a pattern in variants, or identifying which are rare/common across your cohort.

Either that or focusing on the VCF files themselves.

Can you confirm this suspicion by reporting the following (total variants called across all samples):

zgrep -v '#' /data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/batch_vcfs/group1/group1_v39.annotated.vcf.gz | wc -l

sav0016 commented 1 year ago

Googling this error makes it seem like we go past the allotted length of an R vector object!! Because you presumably have (on the order of) ~10 million variants across your 240 samples, this command expands the table dimensions from 10million x 240 into a table with dimensions 2.4B x 2 (or something) which is too big!

I will try to think of a better way to fix this in the future, but for now I would just focus on working directly with /data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/data_tables/group1/group1_v39_data_table.Rds If you want to use R for filtering or identifying at a glance a pattern in variants, or identifying which are rare/common across your cohort.

Either that or focusing on the VCF files themselves.

Can you confirm this suspicion by reporting the following (total variants called across all samples):

zgrep -v '#' /data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/batch_vcfs/group1/group1_v39.annotated.vcf.gz | wc -l

zgrep -v '#' /data3/sav0016/drop/puente-new-all/Output/processed_results/rnaVariantCalling/batch_vcfs/group1/group1_v39.annotated.vcf.gz | wc -l 8988604

nickhsmith commented 1 year ago

I've made another push to rvc_yieldSize that reads through the results as a batch as well. Would you mind trying to update and test again?

updated Results file

sav0016 commented 1 year ago

I've made another push to rvc_yieldSize that reads through the results as a batch as well. Would you mind trying to update and test again?

updated Results file

rna_Script_modified_git4_results.log

Hi, we have different error ... :-)

Thanks! J

nickhsmith commented 1 year ago

Ok! Did you reinstall and use drop update or just change the one line in the Scripts/... function?

If you did just the script change you will have to rename snakmake@config$rnaVariantCalling$yieldSize to 100000, since this value is only built using the config file object which is only re-updated with the background updates in the last git push.

If you did reinstall and drop update, then I will continue to think

sav0016 commented 1 year ago

Ok! Did you reinstall and use drop update or just change the one line in the Scripts/... function?

If you did just the script change you will have to rename snakmake@config$rnaVariantCalling$yieldSize to 100000, since this value is only built using the config file object which is only re-updated with the background updates in the last git push.

If you did reinstall and drop update, then I will continue to think

Thank you. This one passed. J

nickhsmith commented 1 year ago

Hurray! Thanks so much for your patience, and I hope your DROP results are great :)