cBioPortal / cbioportal

cBioPortal for Cancer Genomics
https://cbioportal.org
GNU Affero General Public License v3.0
665 stars 534 forks source link

Generic assay data takes too long to import #8907

Closed AndrewZoldy closed 3 years ago

AndrewZoldy commented 3 years ago

Hello, in my project the team has two files for cbioportal application which we are loading as "Generic Assay" data. Both files contains expression data measured with mass spectrometry on site level, each for one site per file. One of them contains 18767 rows for 109 samples (plus column with gene and column with site name) and second one has 101266 rows for 109 samples (plus gene an site columns). Processing for the first one took around 5 hours, and for the second - around 25 hours. Meanwhile the proteomics data, which consists of pretty similar data structure processed in 46 seconds (10275 rows, 109 samples).

We got a little investigation into the cbioportal code and it looks like for generic assay data it goes into database for each row separately. (https://github.com/cBioPortal/cbioportal/blob/master/core/src/main/java/org/mskcc/cbio/portal/scripts/ImportGenericAssayEntity.java#L188) If I'm wrong, then could you please explain what could be the reason? May we get any changes in our meta files to fix this maybe?

The meta files looks as follows (same structure for both):

cancer_study_identifier: <NAME>
genetic_alteration_type: GENERIC_ASSAY
generic_assay_type: PHOSPHOSITE_QUANTIFICATION
datatype: LIMIT-VALUE
stable_id: <OUR_ID>
profile_name: <OUR_NAME>
profile_description: <OUR_DESCRIPTION>
data_filename: <OUR_DATA_FILE.TXT>
show_profile_in_analysis_tab: true
pivot_threshold_value: 0
value_sort_order: ASC
patient_level: false
generic_entity_meta_properties: GENE_SYMBOL,PHOSPHOSITE

Best, Andrey

dippindots commented 3 years ago

Hi Andrey, thanks for brought up this issue to us. You are right, currently, meta has been loaded line by line, it was not a problem before, but for such data which has 101266 rows, it could be a problem. I am sure we can make some improvements on data importing. I will take a look in early October.

Just want to make sure that only data loading is slow, right? Not the website data loading.

Gaofei

AndrewZoldy commented 3 years ago

Hello, Gaofei, Yes, the issue is only in the importing, website works good. Thank you for your answer!

dippindots commented 3 years ago

Good to know that. I will look into this later and update to ticket.

dippindots commented 3 years ago

Hi @AndrewZoldy, one performance improvement has been made, you can use the release version after 3.7.15 to have a faster-importing speed.