OHDSI / StudyProtocols

Repository of OHDSI Collaborative Research Protocols
37 stars 41 forks source link

Error in Data frame aggregation #12

Open shinseojeong opened 7 years ago

shinseojeong commented 7 years ago

Hello, I'm Seojeong, a graduate student of Ajou university in Korea. I got a issue while I running the LargeScalePopEst Code.

I was trying to execute "injectionSignals" parameter especially for comfirming whether each step working rightly and then an error message was shown just like below. (Note that the parameters "createCohorts" and "fetchAllDataFromServer" worked well although one warning message poped up during executing the "createCohorts" step.)

↓ The warning message during "createCohort" step. image

↓ The Error message during "injectSignals" step. image

And I figured out that the length of [ unique(data$outcomeId) ] and [ negativeControlIds ] is different each other.

image

image

So I ask if it is possible that the code of LargeScalePopEst protocols can be altered.

↓ The Create outcomes file CODE image

I think it would be better that the code should allow or skip the absent of some "outcomeId". Could you modify this Code for us?

Please review this issue.

Thank you!

schuemie commented 7 years ago

Hi @shinseojeong, thank you for the detailed feedback!

You can ignore the warning, it is not important and should be fixed in the latest version.

I hope I have fixed the error, by allowing outcome counts to be zero: https://github.com/OHDSI/StudyProtocols/commit/fc018335e659399be97f58c380861e25defeffb0

Could you reinstall the package and try again? You can of course skip the createCohorts and fetchAllDataFromServer again as you did here.

shinseojeong commented 7 years ago

Thanks a lot, @schuemie ! I'll try again and let you know the results.

shinseojeong commented 7 years ago

Hi @schuemie , I tried the code you fixed, but it didn't executed further at "50%" for about 4~5 days.

image

So, I stopped the session and let the code restarted without removing the directories and files made at first try. However it could't progressed further at "0%" this time for about 2 days.

image

Please check those pictures above for me... Thanks a lot.

schuemie commented 7 years ago

Hi @shinseojeong,

Can you tell me what database platform you're using? Microsoft SQL Server?

shinseojeong commented 7 years ago

Yes, I'm using Microsoft SQL Server database platform.

schuemie commented 7 years ago

I'm not sure why it is so slow. The two steps it is performing (when the progressbar is visible) are

  1. Deleting any outcomes injected in a previous (aborted) run
  2. Copying the baseline outcomes + adding the injected outcomes (from a temp table).

The first time you ran it step 1 of course didn't take any time, but the second step shouldn't have taken 4-5 days. It sounds like the database server is having problems, for example because it doesn't have enough temp space. Could you discuss with your database administrator?

shinseojeong commented 7 years ago

OK, I'll discuss with our database administrator. Thank you for giving me your opinions regarding problems.

chrisknoll commented 7 years ago

Also you can execute the command 'sp_who2' (if you have the correct privileges) and you should see the active sessions connected to the database. The important column to look at is the 'BlkBy' column to see if there seems to be some kind of resource block happening that is holding up the copy. If it's just copying from a temp table into another table, then I can't imagine that taking days. So, as @schuemie suggested: check with your administrator as the query is executing to see if anything looks out of the ordinary.

shinseojeong commented 7 years ago

Hi, @schuemie I was able to finished the "injectSignals" step when I reran the code after deleting outcomes injected in a previous (aborted) run.

image

So I executed the next step "generateAllCohortMethodDataObjects" but during "Constructing cohortMethodData objects", I got an error message like below.

image

I confirmed ".vimplemented" and found that the implemente state of character is 'FALSE'.

image

I think some codes should be altered to let it recognized characters as factors. I referenced the web site 'http://stackoverflow.com/questions/21911721/character-vectors-as-ff-objects-in-r'.

image

Could you review the situation and fix some codes?

Thank you!

schuemie commented 7 years ago

I don't think the problem is that characters isn't supported, but instead the problem is why characters are encountered in the first place. (Remember: the code has executed without problems at least in one environment).

The error message unfortunately isn't very helpful, so I'll have to ask you to debug. Could you first type

debug(constructCohortMethodDataObject)

and then rerun generateAllCohortMethodDataObjects? That should allow you to step through the code until the error occurs. It would be good to now the exact line where things go wrong.

shinseojeong commented 7 years ago

I tried to debugging and the code step through until the error below.

image

schuemie commented 7 years ago

Right, that doesn't really help us...

Next plan: could you run the code below? It will create two new functions that are basically the functions in the package, but then with lots of debugging output. After running that code, you can just call generateAllCohortMethodDataObjectsDebug(workFolder) to run the function, and copy-paste any output to me.

constructCohortMethodDataObjectDebug <- function(targetId,
                                            comparatorId,
                                            targetConceptId,
                                            comparatorConceptId,
                                            workFolder) {
    # Subsetting cohorts
    ffbase::load.ffdf(dir = file.path(workFolder, "allCohorts"))
    ff::open.ffdf(cohorts, readonly = TRUE)
    writeLines(paste0("nrow(cohorts) = ", nrow(cohorts)))
    idx <- cohorts$cohortDefinitionId == targetId | cohorts$cohortDefinitionId == comparatorId
    cohorts <- ff::as.ram(cohorts[ffbase::ffwhich(idx, idx == TRUE), ])
    writeLines(paste0("After filtering: nrow(cohorts) = ", nrow(cohorts)))
    cohorts$treatment <- 0
    cohorts$treatment[cohorts$cohortDefinitionId == targetId] <- 1
    cohorts$cohortDefinitionId <- NULL
    treatedPersons <- length(unique(cohorts$subjectId[cohorts$treatment == 1]))
    comparatorPersons <- length(unique(cohorts$subjectId[cohorts$treatment == 0]))
    treatedExposures <- length(cohorts$subjectId[cohorts$treatment == 1])
    comparatorExposures <- length(cohorts$subjectId[cohorts$treatment == 0])
    counts <- data.frame(description = "Starting cohorts",
                         treatedPersons = treatedPersons,
                         comparatorPersons = comparatorPersons,
                         treatedExposures = treatedExposures,
                         comparatorExposures = comparatorExposures)
    metaData <- list(targetId = targetId,
                     comparatorId = comparatorId,
                     attrition = counts)
    attr(cohorts, "metaData") <- metaData

    # Subsetting outcomes
    ffbase::load.ffdf(dir = file.path(workFolder, "allOutcomes"))
    ff::open.ffdf(outcomes, readonly = TRUE)
    writeLines(paste0("nrow(outcomes) = ", nrow(outcomes)))
    idx <- !is.na(ffbase::ffmatch(outcomes$rowId, ff::as.ff(cohorts$rowId)))
    if (ffbase::any.ff(idx)){
        outcomes <- ff::as.ram(outcomes[ffbase::ffwhich(idx, idx == TRUE), ])
    } else {
        outcomes <- as.data.frame(outcomes[1, ])
        outcomes <- outcomes[T == F,]
    }
    # Add injected outcomes
    ffbase::load.ffdf(dir = file.path(workFolder, "injectedOutcomes"))
    ff::open.ffdf(injectedOutcomes, readonly = TRUE)
    writeLines(paste0("nrow(injectedOutcomes) = ", nrow(injectedOutcomes)))
    injectionSummary <- read.csv(file.path(workFolder, "signalInjectionSummary.csv"))
    injectionSummary <- injectionSummary[injectionSummary$exposureId %in% c(targetConceptId, comparatorConceptId), ]
    idx1 <- ffbase::'%in%'(injectedOutcomes$subjectId, cohorts$subjectId)
    idx2 <- ffbase::'%in%'(injectedOutcomes$cohortDefinitionId, injectionSummary$newOutcomeId)
    idx <- idx1 & idx2
    if (ffbase::any.ff(idx)){
        injectedOutcomes <- ff::as.ram(injectedOutcomes[idx, ])
        colnames(injectedOutcomes)[colnames(injectedOutcomes) == "cohortStartDate"] <- "eventDate"
        colnames(injectedOutcomes)[colnames(injectedOutcomes) == "cohortDefinitionId"] <- "outcomeId"
        injectedOutcomes <- merge(cohorts[, c("rowId", "subjectId", "cohortStartDate")], injectedOutcomes[, c("subjectId", "outcomeId", "eventDate")])
        injectedOutcomes$daysToEvent = injectedOutcomes$eventDate - injectedOutcomes$cohortStartDate
        #any(injectedOutcomes$daysToEvent < 0)
        #min(outcomes$daysToEvent[outcomes$outcomeId == 73008])
        outcomes <- rbind(outcomes, injectedOutcomes[, c("rowId", "outcomeId", "daysToEvent")])
    }
    metaData <- data.frame(outcomeIds = unique(outcomes$outcomeId))
    attr(outcomes, "metaData") <- metaData

    # Subsetting covariates
    covariateData <- FeatureExtraction::loadCovariateData(file.path(workFolder, "allCovariates"))
    writeLines(paste0("names(cohorts) = ", names(cohorts)))
    writeLines(paste0("ff::vmode(cohorts$rowId) = ", ff::vmode(cohorts$rowId)))
    idx <- is.na(ffbase::ffmatch(covariateData$covariates$rowId, ff::as.ff(cohorts$rowId)))
    covariates <- covariateData$covariates[ffbase::ffwhich(idx, idx == FALSE), ]

    # Filtering covariates
    filterConcepts <- readRDS(file.path(workFolder, "filterConceps.rds"))
    filterConcepts <- filterConcepts[filterConcepts$exposureId %in% c(targetId, comparatorId),]
    filterConceptIds <- unique(filterConcepts$filterConceptId)
    writeLines(paste0("length(filterConceptIds) = ", length(filterConceptIds)))
    writeLines(paste0("class(filterConceptIds) = ", class(filterConceptIds)))
    idx <- is.na(ffbase::ffmatch(covariateData$covariateRef$conceptId, ff::as.ff(filterConceptIds)))
    covariateRef <- covariateData$covariateRef[ffbase::ffwhich(idx, idx == TRUE), ]
    filterCovariateIds <- covariateData$covariateRef$covariateId[ffbase::ffwhich(idx, idx == FALSE), ]
    idx <- is.na(ffbase::ffmatch(covariates$covariateId, filterCovariateIds))
    covariates <- covariates[ffbase::ffwhich(idx, idx == TRUE), ]

    result <- list(cohorts = cohorts,
                   outcomes = outcomes,
                   covariates = covariates,
                   covariateRef = covariateRef,
                   metaData = covariateData$metaData)

    class(result) <- "cohortMethodData"
    return(result)
}

generateAllCohortMethodDataObjectsDebug <- function(workFolder) {
    writeLines("Constructing cohortMethodData objects")
    start <- Sys.time()
    exposureSummary <- read.csv(file.path(workFolder, "exposureSummaryFilteredBySize.csv"))
    # pb <- txtProgressBar(style = 3)
    for (i in 1:nrow(exposureSummary)) {

        targetId <- exposureSummary$tprimeCohortDefinitionId[i]
        comparatorId <- exposureSummary$cprimeCohortDefinitionId[i]
        targetConceptId <- exposureSummary$tCohortDefinitionId[i]
        comparatorConceptId <- exposureSummary$cCohortDefinitionId[i]
        folderName <- file.path(workFolder, "cmOutput", paste0("CmData_l1_t", targetId, "_c", comparatorId))
        writeLines(paste0("Generating folder ", folderName))
        # if (!file.exists(folderName)) {
            cmData <- constructCohortMethodDataObjectDebug(targetId = targetId,
                                                      comparatorId = comparatorId,
                                                      targetConceptId = targetConceptId,
                                                      comparatorConceptId = comparatorConceptId,
                                                      workFolder = workFolder)
            # CohortMethod::saveCohortMethodData(cmData, folderName)
        # }
        # setTxtProgressBar(pb, i/nrow(exposureSummary))
    }
    # close(pb)
    delta <- Sys.time() - start
    writeLines(paste("Generating all CohortMethodData objects took", signif(delta, 3), attr(delta, "units")))
}