cBioPortal / rfc80-team

repository to hold issues for the rfc80 development / deployment team
0 stars 0 forks source link

Profile the clinical data binning aggregation step #4

Closed alisman closed 4 months ago

onursumer commented 4 months ago

Profiled api/column-store/clinical-data-bin-counts/fetch endpoint by using the columnar-clinical-data-binner branch and a cloud clickhouse instance which utilizes regular views only.

Fetch method being invoked

fetch("http://localhost:8080/api/column-store/clinical-data-bin-counts/fetch?dataBinMethod=STATIC", {
  "headers": {
    "accept": "application/json",
    "accept-language": "en-US,en;q=0.9,tr;q=0.8,fi;q=0.7",
    "content-type": "application/json",
    "sec-ch-ua": "\"Chromium\";v=\"124\", \"Google Chrome\";v=\"124\", \"Not-A.Brand\";v=\"99\"",
    "sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": "\"macOS\"",
    "sec-fetch-dest": "empty",
    "sec-fetch-mode": "cors",
    "sec-fetch-site": "same-origin"
  },
  "referrer": "http://localhost:8080/study/summary?id=genie_public",
  "referrerPolicy": "strict-origin-when-cross-origin",
  "body": "{\"attributes\":[{\"attributeId\":\"MUTATION_COUNT\",\"disableLogScale\":false,\"showNA\":true},{\"attributeId\":\"FRACTION_GENOME_ALTERED\",\"disableLogScale\":false,\"showNA\":true},{\"attributeId\":\"AGE_AT_SEQ_REPORT\",\"disableLogScale\":false,\"showNA\":true},{\"attributeId\":\"INT_CONTACT\",\"disableLogScale\":false,\"showNA\":true},{\"attributeId\":\"INT_DOD\",\"disableLogScale\":false,\"showNA\":true},{\"attributeId\":\"YEAR_DEATH\",\"disableLogScale\":false,\"showNA\":true},{\"attributeId\":\"YEAR_CONTACT\",\"disableLogScale\":false,\"showNA\":true}],\"studyViewFilter\":{\"studyIds\":[\"genie_public\"],\"alterationFilter\":{\"copyNumberAlterationEventTypes\":{\"AMP\":true,\"HOMDEL\":true},\"mutationEventTypes\":{\"any\":true},\"structuralVariants\":null,\"includeDriver\":true,\"includeVUS\":true,\"includeUnknownOncogenicity\":true,\"includeUnknownTier\":true,\"includeGermline\":true,\"includeSomatic\":true,\"includeUnknownStatus\":true,\"tiersBooleanMap\":{}}}}",
  "method": "POST",
  "mode": "cors",
  "credentials": "include"
});

Overview

clinical-data-bin-counts_2024-05-16

getFilteredSamples CPU time is about 31%

getFilteredSamples

calcuNaDataBin CPU time is about 5%

calcNaData

countNAs CPU time is about 5% as well

countNAs

It should be relatively easier to calculate/count NAs with some SQL queries instead of fetching the filtered sample data and processing it in Java. That way we may be able to improve the performance of this endpoint by 30 to 40 percent.

inodb commented 4 months ago

Doing https://github.com/cBioPortal/rfc80-team/issues/16 now before doing more profiling