Error: database or disk is full

hms1 commented 4 years ago

This error occurs when running the FeatureExtraction::tidyCovariates function during PLE, but seems most related to Andromeda.

Running these lines results in the error Error: database or disk is full:

https://github.com/OHDSI/FeatureExtraction/blob/bfa7d6fe33a362d1c28598af4fac11249141f628/R/Normalization.R#L52-L54

The study is very large (zipped Andromeda is 3.7gb), but there appears to be plenty of disk space available in andromedaTempFolder (I am using RStudio server on EC2 with EFS storage for /home if that makes a difference).

The following, although slightly simplified compared to the original code, reproduces the error on my machine:

library(CohortMethod)

andromeda_data <- "/home/hmorgan/IUD/open_claims2/IUDClaimsStudy/cmOutput/CmData_l1_t1772699_c1772698.zip"

options(andromedaTempFolder = "/home/hmorgan/tmp")

Andromeda::getAndromedaTempDiskSpace() / 1024^3

file.info(andromeda_data)$size / 1024^3

cohortMethodData <- CohortMethod::loadCohortMethodData(andromeda_data)

population <- cohortMethodData$cohorts %>% collect()

covariates <- cohortMethodData$covariates %>% filter(.data$rowId %in% local(population$rowId))

covariateData <- Andromeda::andromeda(
  covariates = covariates, 
  covariateRef = cohortMethodData$covariateRef,
  analysisRef = cohortMethodData$analysisRef)

metaData <- attr(cohortMethodData, "metaData")
metaData$populationSize <- nrow(population)
attr(covariateData, "metaData") <- metaData
class(covariateData) <- "CovariateData"

covariateData$maxValuePerCovariateId <- covariateData$covariates %>% 
  group_by(.data$covariateId) %>% 
  summarise(maxValue = max(.data$covariateValue, na.rm = TRUE))

Some online reports suggest corrupt data, but I've re-gerenated all artifacts and still have the same problem.

It also works if I make maxValuePerCovariateId smaller, so it seems to be a genuine size issue rather than something else.

covariateData$maxValuePerCovariateId <- covariateData$covariates %>% 
  mutate(rn = row_number()) %>%
  filter(rn < 100000) %>%
  group_by(.data$covariateId) %>% 
  summarise(maxValue = max(.data$covariateValue, na.rm = TRUE))

Sorry if I'm missing a setting, which is very possible.

hms1 commented 4 years ago

I looked into this a bit more, and the problem is that RSQLite uses a default temp_store_directory that has limited space on the server I'm using (I think it's /var/tmp).

This is different to andromedaTempFolder but can be made the same using something like:

RSQLite::dbExecute(
  covariateData, 
  SqlRender::render("PRAGMA temp_store_directory = '@andromedaTempFolder'", 
                    andromedaTempFolder = getOption("andromedaTempFolder"))
)

After running this (either against covariateData or an earlier Andromeda object) the query runs fine.

I'm not sure if this should be incorporated somewhere in the Andromeda package, and if so where?

schuemie commented 4 years ago

Thanks! Others recently also noticed this behavior, but I didn't know what the problem was. I'll add it to the code.

schuemie commented 4 years ago

Would you mind testing the new version in the develop branch? It can be installed using

remotes::install_github("ohdsi/Andromeda", ref = "develop")

Including @jreps.

hms1 commented 4 years ago

That seems to be working - thank you !

I'll also test in the context of the full study run.

hms1 commented 4 years ago

I've used the develop branch a fair bit now and it seems to be working well 👍.

Slightly off-topic, but I had to make a few changes to the comparativeEffectStudy package I was running to accommodate Andromeda.

It looks like you've got most of it covered in the andromeda branch of SkeletonComparativeEffectStudy, but if more work is needed I'm happy to help.

schuemie commented 4 years ago

Thanks!

Yes, I realized I forgot to merge the andromeda branch into master of SkeletonComparativeEffectStudy. I did so yesterday. I've also released a new version of Hydra, which hopefully soon will find its way into ATLAS.

OHDSI / Andromeda

Error: database or disk is full #1