Phenotype Simulator: RAM requirements

beatrixOngit commented 5 years ago

Describe the bug I am simulating a relatively larger genomic dataset for a follow-up GWAS analysis. While I was able to run an entire simulation with a smaller population when I try to simulate a real-life organism (related to my work) I run out of RAM (32Gb on an AWS EC2 at the moment) pretty much at readStandardGenotypes call.

To Reproduce Here is a snippet of my code: library(PhenotypeSimulator)

Read Genotype file, as output from GENOME

N <- 10000 fileName = "outputOct14.txt" genotypefile <- system.file("extdata/genotypes/genome",fileName,package = "PhenotypeSimulator")

Run read StandardGenotypes method from PhenotypeSimulator Library

a <- readStandardGenotypes(N = N, genotypefile, format ="genome", verbose = FALSE, sampleID = "ID", snpID = "SNP", delimiter = ",") 5-15 phenotypes, causal Number of SNPs = 200, variance = 0.6

phenotype <- runSimulation(N = N, P = 5, genotypefile = genotypefile, format = "genome", cNrSNP = 200, genVar = 0.6, h2s = 0.01666, phi = 0.6, delta = 0.3, distBetaGenetic = "unif", mBetaGenetic = 0.5, sdBetaGenetic = 1, NrFixedEffects = 4,
distConfounders = c("bin", "cat_norm", "cat_unif", "norm"), probConfounders = 0.2, catConfounders = c(3, 4), pcorr = 0.8, verbose = FALSE);

Error messages The kernel for the R script on Jupyter either dies or gets killed when run from the Command Line Interface.

Expected behavior Finish the simulation to be able to generate plink and gemma files for GWAS

OS: [e.g. iOS] MacOS, AWS - server, ubuntu, EC2 with 32Gb RAM. processors = 8, shell and Jupyter Notebook

Additional

Any insights into what I should aim for in terms of RAM to be able to run this set of command?
An additional question I had was on reducing the time it takes to print the outputs to gemma and plink files. Is this something that anyone else has encountered as well?

Any help is appreciated.

HannahVMeyer commented 5 years ago

Could you please provide the sessionInfo(). What are the dimensions of your genotype matrix that you are trying to read with readStandardGenotypes? How large is the actual file? In general, I would not recommend storing your own data files in the directory of the package. Is that what you intended to do?

beatrixOngit commented 5 years ago

Session Info R version 3.5.1 (2018-07-02) Platform: x86_64-conda_cos6-linux-gnu (64-bit) Running under: Ubuntu 18.04.2 LTS

Matrix products: default BLAS/LAPACK: /home/ubuntu/miniconda3/lib/R/lib/libRblas.so

locale: [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

Genome file size is about 5Gb and Total number of SNPs = 529311.

beatrixOngit commented 5 years ago

I am not sure of the genome matrix that you mentioned

HannahVMeyer commented 5 years ago

Thanks for getting back. The sessionInfo() you sent does not contain any loaded packages. Can you please run the command after you did library(PhenotypeSimulator).

I am referring to the genotype matrix you are trying to read with:

fileName = "outputOct14.txt"
genotypefile <- system.file("extdata/genotypes/genome",fileName,package = "PhenotypeSimulator")

This would mean, the genotypes that you are trying to read are in the file: /path/2/tour/Rlibrary/folder/PhenotypeSimulator/extdata/genotypes/genome/outputOct14.txt

beatrixOngit commented 5 years ago

Entire Session Info R version 3.5.1 (2018-07-02) Platform: x86_64-conda_cos6-linux-gnu (64-bit) Running under: Ubuntu 18.04.2 LTS

Matrix products: default BLAS/LAPACK: /home/ubuntu/miniconda3/lib/R/lib/libRblas.so

locale: [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] PhenotypeSimulator_0.3.3

loaded via a namespace (and not attached): [1] Rcpp_1.0.2 plyr_1.8.4 compiler_3.5.1
[4] pillar_1.4.2 R.methodsS3_1.7.1 R.utils_2.9.0
[7] base64enc_0.1-3 tools_3.5.1 zlibbioc_1.28.0
[10] digest_0.6.15 uuid_0.1-2 lattice_0.20-38
[13] jsonlite_1.5 evaluate_0.11 tibble_2.1.3
[16] gtable_0.3.0 pkgconfig_2.0.2 rlang_0.4.0
[19] Matrix_1.2-17 IRdisplay_0.5.0 parallel_3.5.1
[22] IRkernel_0.8.12 repr_0.15.0 stringr_1.3.1
[25] dplyr_0.8.3 snpStats_1.32.0 getopt_1.20.3
[28] cowplot_1.0.0 grid_3.5.1 tidyselect_0.2.5
[31] optparse_1.6.2 glue_1.3.0 R6_2.4.0
[34] survival_2.44-1.1 pbdZMQ_0.3-3 reshape2_1.4.3
[37] ggplot2_3.2.1 purrr_0.3.2 magrittr_1.5
[40] BiocGenerics_0.28.0 scales_1.0.0 htmltools_0.3.6
[43] splines_3.5.1 assertthat_0.2.1 colorspace_1.4-1
[46] stringi_1.2.4 lazyeval_0.2.2 munsell_0.5.0
[49] crayon_1.3.4 R.oo_1.22.0

beatrixOngit commented 5 years ago

The reason I keep copying my genotype file (output of genome, with 529311 SNPs X 10000 Individuals) is that if I provide my own path in system.file("extdata/genotypes/genome") I get an error about encoding. I can post the error shortly - it only goes away when the file path is exactly as above.

Genotype file dimensions: output of Genome, with 529311 SNPs X 10000 Individuals

HannahVMeyer commented 5 years ago

You do no thave to use sytem.file to provide the filepath. I only use this so people can use the example data provided by the package and test the input/output functions. You should be able to simply use

genotypefile <-  "path/to/outputOct14.txt"
a <- readStandardGenotypes(N = N, genotypefile, format ="genome",
verbose = FALSE, sampleID = "ID_", snpID = "SNP_", delimiter = ",")

to read the file.

Can you also tell me what the file size is, not the dimensions? What you are basically trying to do is read the entire genotype file into memory. If the file is large, then that will require a lot of memory.

If you have a unix system, you should be able to see the file size with

ls -lh outputOct14.txt

beatrixOngit commented 5 years ago

Oh, of course!! Sorry, such oversight on my part. I didn't realize the system.file command did not have to be part of the run command.

The output file is 5Gb.

I am running the simulator again today - will update or if it runs just fine, I will close the issue

beatrixOngit commented 5 years ago

Hi again,

Changing the path is now throwing an entirely different error. I have tried running on the R interface since my notebook just freezes with no error.

library(PhenotypeSimulator) N<-8000 genotypefile <-'/simulations/genomeTool/outputOct16.txt' readStandardGenotypes(N=N, genotypefile, format='genome', sample sampleID= sample sample.int readStandardGenotypes(N=N, genotypefile, format='genome', sampleID="ID", snpID="SNP", delimiter=",")

caught segfault address 0x7f87b097c3ad, cause 'invalid permissions'

Traceback: 1: data.table::fread(filename, skip = "POP1:", sep = " ", colClasses = "character", header = FALSE) 2: readStandardGenotypes(N = N, genotypefile, format = "genome", sampleID = "ID", snpID = "SNP", delimiter = ",")

Possible actions: 1: abort (with core dump, if enabled) 2: normal R exit 3: exit R without saving workspace 4: exit R saving workspace

sessionInfo() R version 3.5.1 (2018-07-02) Platform: x86_64-conda_cos6-linux-gnu (64-bit) Running under: Ubuntu 18.04.2 LTS

Matrix products: default BLAS/LAPACK: /home/ubuntu/miniconda3/lib/R/lib/libRblas.so

locale: [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8 [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] PhenotypeSimulator_0.3.3

loaded via a namespace (and not attached): [1] Rcpp_1.0.2 magrittr_1.5 BiocGenerics_0.28.0 [4] splines_3.5.1 zlibbioc_1.28.0 getopt_1.20.3 [7] cowplot_1.0.0 tidyselect_0.2.5 munsell_0.5.0 [10] colorspace_1.4-1 lattice_0.20-38 R6_2.4.0 [13] rlang_0.4.0 stringr_1.3.1 plyr_1.8.4 [16] dplyr_0.8.3 tools_3.5.1 parallel_3.5.1 [19] grid_3.5.1 data.table_1.12.2 gtable_0.3.0 [22] R.oo_1.22.0 snpStats_1.32.0 survival_2.44-1.1 [25] lazyeval_0.2.2 assertthat_0.2.1 optparse_1.6.2 [28] tibble_2.1.3 crayon_1.3.4 Matrix_1.2-17 [31] reshape2_1.4.3 purrr_0.3.2 ggplot2_3.2.1 [34] R.utils_2.9.0 glue_1.3.0 stringi_1.2.4 [37] compiler_3.5.1 pillar_1.4.2 R.methodsS3_1.7.1 [40] scales_1.0.0 pkgconfig_2.0.2

And filesize of Genomic file: 5Gb

HannahVMeyer commented 5 years ago

The seqfault seems to be caused by data.table::fread which is called from readStandardGenotypes. Are you sure you are providing the right path? what does ls /simulations/genomeTool/outputOct16.txt on the command line return?

Looking at it, I would suspect it might have to be genotypefile <- '~/simulations/genomeTool/outputOct16.txt' instead of genotypefile <-'/simulations/genomeTool/outputOct16.txt, ie include a tilde for the home directory expansion or add the home directory directly. If this is the case, it should be raised as an issue with data.table. Please let me know

beatrixOngit commented 5 years ago

Actually, it is the full path right from the home directory. I had to truncate it here only for privacy.

HannahVMeyer commented 5 years ago

what happens if you try: data.table::fread(genotypefile, skip = "POP1:", sep = " ", colClasses = "character", header = FALSE) or data.table::fread(genotypefile, sep = " ", colClasses = "character", header = FALSE) ?

beatrixOngit commented 5 years ago

it runs without error, spews out all of the text within the genotypefile (some 1000s of lines)

HannahVMeyer commented 5 years ago

in the same session, if you try to run readStandardGenotypes(N = N, genotypefile, format = "genome", sampleID = "ID_", snpID = "SNP_", delimiter = ","), does that still work?

beatrixOngit commented 5 years ago

No, Dr. Meyer. That on its own still breaks in the same session. Vector allocation error now. I am now running my genotype simulator with a smaller N.

HannahVMeyer commented 5 years ago

Would you be able to share your genotype file with me? I will treat it confidentially. Without it and no other fail example, I fear I won't be able to track this down any further.

beatrixOngit commented 5 years ago

Good morning, Dr. Meyer,

I was able to run the tool with a dataset 1/5th the size. I also figured out if I were to provide my own delimited file instead of the output of GENOME, I can run a slightly larger size. One of the aims of this experiment is to find out the dynamic range of what is a 'too large' dataset.

Unable to share the genotype file (sorry, client policy) but I can give you an example of the simulation I ran to create one -

./genome -pop 8 1000 1000 1000 1000 1000 1000 1000 1000 -N 8000 -c 20 pieces 5000 -len 10000 -rec 0.0001 [this one did not run successfully]

generated using the GENOME tool, you mentioned in your manual.

Thanks a lot for your help,

HannahVMeyer commented 5 years ago

thanks for providing the genome command. I have tried to run it as above, but it returns an error message for me:

genome -pop 8 1000 1000 1000 1000 1000 1000 1000 1000 -N 8000 -c 20 pieces 5000 -len 10000 -rec 0.0001
Parameters do not match! 21 5

Can you double-check your command please. I am using genome-0.2.

beatrixOngit commented 5 years ago

You might need to type it out. When I tried to copy from my notepad to terminal, I get the same error. Typing it out, I guess, ensures that there aren’t any erroneous characters.

On Tue, Oct 22, 2019 at 9:55 AM HannahVMeyer notifications@github.com wrote:

thanks for providing the genome command. I have tried to run it as above, but it returns an error message for me:

genome -pop 8 1000 1000 1000 1000 1000 1000 1000 1000 -N 8000 -c 20 pieces 5000 -len 10000 -rec 0.0001 Parameters do not match! 21 5

Can you double-check your command please. I am using genome-0.2.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/HannahVMeyer/PhenotypeSimulator/issues/18?email_source=notifications&email_token=AIDRNU4MAYY3DRYTYZAMXVTQP4H6ZA5CNFSM4JBNWMJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEB6BMUQ#issuecomment-545003090, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIDRNU5JVKFGEE62NRGBST3QP4H6ZANCNFSM4JBNWMJQ .

--

Benu Atri, Ph.D.

HannahVMeyer commented 5 years ago

This does not work, whether typed or copied. Still the same parameter mismatch error. I did double-check now and your command is missing a dash in front of pieces. The correct command should be

 genome -pop 8 1000 1000 1000 1000 1000 1000 1000 1000 -N 8000 -c 20 -pieces 5000 -len 10000 -rec 0.0001

beatrixOngit commented 5 years ago

Oh shoot. Yea, that’s my bad. Sorry about that

On Tue, Oct 22, 2019 at 2:50 PM HannahVMeyer notifications@github.com wrote:

This does not work, whether typed or copied. Still the same parameter mismatch error. I did double-check now and your command is missing a dash in front of pieces. The correct command should be

genome -pop 8 1000 1000 1000 1000 1000 1000 1000 1000 -N 8000 -c 20 -pieces 5000 -len 10000 -rec 0.0001

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/HannahVMeyer/PhenotypeSimulator/issues/18?email_source=notifications&email_token=AIDRNU4OGAJUXOQZFLV5SRLQP5KPVA5CNFSM4JBNWMJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEB67CLQ#issuecomment-545124654, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIDRNU5UIS7TU234V5RXRVDQP5KPVANCNFSM4JBNWMJQ .

--

Benu Atri, Ph.D.

HannahVMeyer commented 5 years ago

I've used your command to simulate genotypes for 8000 individuals:

genome -pop 8 1000 1000 1000 1000 1000 1000 1000 1000 \
               -N 8000 \
               -c 20 \
               -pieces 5000 \
               -len 10000 \
               -rec 0.0001 > genome_simulated_genotypes.txt

The resulting file is 19GB large, which is about 4 times as large as suggested above ( 5GB)

ls -lth genotypes_genome.txt
 19G 23 Oct 04:43 genotypes_genome.txt

If you want to simulate based on these genotypes and you choose to read the entire file into memory as you are currently trying to do, you will need at least that much memory, plus enough RAM for the phenotype components to be simulated and intermediate objects that will be generated such as the genetic kinship. I have not run a thorough test and how much memory each of these components require, but my guess would be that you need at least 22G of RAM that R can use, ideally more.

I would recommend converting the output from genome into a delimited file, as you have done and instead of reading the entire genotype file into memory, use PhenotypeSimulator's option of sampling causal genotypes directly from file via getCausalSNPs:

getCausalSNPs {PhenotypeSimulator}

**Draw random SNPs from genotypes.**
**Description**
Draw random SNPs from genotypes provided or external genotype files. When drawing from external genotype files, only lines of randomly chosen SNPs are read, which is recommended for large genotype files. See details for more information. The latter option currently supports file in simple delim-formats (with specified delimiter and optional number of fields to skip) and the bimbam and the oxgen format.

**Usage**
getCausalSNPs(N, NrCausalSNPs = 20, genotypes = NULL, chr = NULL,
  NrSNPsOnChromosome = NULL, NrChrCausal = NULL,
  genoFilePrefix = NULL, genoFileSuffix = NULL, format = "delim",
  delimiter = ",", header = FALSE, skipFields = NULL,
  probabilities = FALSE, sampleID = "ID_", verbose = TRUE)

You do this by providing these parameters:

genoFilePrefix, genoFileSuffix, format, delimiter, header, skipFields, probabilities

Their description is in the function documentation.

If you choose to do this and want to include a kinship matrix, you will have to estimate the kinship a priori as well.

beatrixOngit commented 5 years ago

This makes perfect sense.

I already ran a pilot with the delimited text file option and was able to obtain Gemma appropriate phenotype files. I will also use your suggestion for causal genotypes for my real data.

Also, my supervisor was able to increase the available RAM, I’m guessing a combination of these and I should be able to run things smoothly.

Thank you so much for all your help and patience, Dr. Meyer. Forever obliged!

On Wed, Oct 23, 2019 at 2:03 PM HannahVMeyer notifications@github.com wrote:

I've used your command to simulate genotypes for 8000 individuals:

genome -pop 8 1000 1000 1000 1000 1000 1000 1000 1000 \ -N 8000 \ -c 20 \ -pieces 5000 \ -len 10000 \ -rec 0.0001 > genome_simulated_genotypes.txt

The resulting file is 19GB large, which is about 4 times as large as suggested above ( 5GB https://github.com/HannahVMeyer/PhenotypeSimulator/issues/18#issuecomment-542840537 )

ls -lth genotypes_genome.txt 19G 23 Oct 04:43 genotypes_genome.txt

If you want to simulate based on these genotypes and you choose to read the entire file into memory as you are currently trying to do, you will need at least that much memory, plus enough RAM for the phenotype components to be simulated and intermediate objects that will be generated such as the genetic kinship. I have not run a thorough test and how much memory each of these components require, but my guess would be that you need at least 22G of RAM that R can use, ideally more.

I would recommend converting the output from genome into a delimited file, as you have done and instead of reading the entire genotype file into memory, use PhenotypeSimulator's option of sampling causal genotypes directly from file via getCausalSNPs:

getCausalSNPs {PhenotypeSimulator} Draw random SNPs from genotypes.**DescriptionDraw random SNPs from genotypes provided or external genotype files. When drawing from external genotype files, only lines of randomly chosen SNPs are read, which is recommended for large genotype files. See details for more information. The latter option currently supports file in simple delim-formats (with specified delimiter and optional number of fields to skip) and the bimbam and the oxgen format. Usage** getCausalSNPs(N, NrCausalSNPs = 20, genotypes = NULL, chr = NULL, NrSNPsOnChromosome = NULL, NrChrCausal = NULL, genoFilePrefix = NULL, genoFileSuffix = NULL, format = "delim", delimiter = ",", header = FALSE, skipFields = NULL, probabilities = FALSE, sampleID = "ID_", verbose = TRUE)

You do this by providing these parameters:

genoFilePrefix, genoFileSuffix, format, delimiter, header, skipFields, probabilities

Their description is in the function documentation.

If you choose to do this and want to include a kinship matrix, you will have to estimate the kinship a priori as well.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/HannahVMeyer/PhenotypeSimulator/issues/18?email_source=notifications&email_token=AIDRNU4Q47YSLQVIH54BOQTQQCNYRA5CNFSM4JBNWMJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECCQW6Q#issuecomment-545590138, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIDRNU6TXVYD35P72TXMK5LQQCNYRANCNFSM4JBNWMJQ .

--

Benu Atri, Ph.D.

HannahVMeyer commented 5 years ago

Glad it is working now, happy to help. I will close this for now, feel free to re-open if you run into other, related issues.

HannahVMeyer / PhenotypeSimulator

Phenotype Simulator: RAM requirements #18

Read Genotype file, as output from GENOME

Run read StandardGenotypes method from PhenotypeSimulator Library