Clarivate-LSPS / tMDataLoader

new Groovy-based tranSMART ETL
8 stars 19 forks source link

Loading Order #34

Closed Kabenla closed 7 years ago

Kabenla commented 8 years ago

I have found that if I have a dataset with both Clinical and Expression data the system will install the ExpressionData first and when it comes time to install the ClinicalData it throws this error WARNING: Rolling back due to: Other study by same path found with different studyId: \Study folder\Study Name\

I am sure there is nothing wrong with the ClinicalData because when I install the ClinicalData alone it loads just fine.

In desperation I removed the entire dataset and disabled the Expression data, run the ETL on the ClinicalData, and then after that reenabled the ExpressionData. This worked just fine and I have both components of the study in TranSMART.

Can we make it so that if there is both ExpressionData and ClinicalData in a Study the software loads the ClinicalData first?

mirasrael commented 8 years ago

@Kabenla Can you please provide minimal study which fails?

Kabenla commented 8 years ago

Hi Alexander, this might prove problematic since I am working with a clients (most likely confidential) data. I will see if I can get something open source and formatted for tMDataLoader environment. The directory format is somewhat like this

ETL_Study_Folder --Kstudy ----ClinicalData --------Kstudy_Mapping_file.txt --------Kstudy.txt ----ExpressionData --------Kstudy_Gene_Expression_Data_L.txt --------GPL570.txt --------Kstudy_Subject_Sample_Mapping_File.txt

I find that tMDataLoader will always load the ExpressionData first and when it goes to the ClinicalData it fails with the error.
However if I first rename ExpressionData to _DISABLED_ExpressionData, and run the ETL process, then rename the ExpressionData folder and change the _DONE_Kstudy back to Kstudy and rerun the ETL process, then everything works .

Not sure how much more info I can give. Do you know of any publicly available Studies that are formatted for your Loader?

xuatiknc commented 8 years ago

@Kabenla you may find a few publicly available studies here: https://github.com/ThomsonReuters-LSPS/tMDataSamples

The load order does not matter. If you are getting this error after loading expression data that means the expression data was not loaded correctly. Changing the load order will just hide this problem. Did you notice anything suspicious in the log file? Also I suggest updating to the latest version: some error handling-related changes were committed today. If that does not help you may check in the database:

select * from cz_job_error where job_id= (select max(job_id) from cz_job_error);

(you need to connect as user tm_dataloader (for PostgreSQL) or tm_cz (for Oracle)). Finally I would check if the platform name is specified correctly in all places.

If none of the above helps please try creating a small dataset with fake data where this problem can be reproduced and send it to us.

P.S. Regarding your previous question about log file. I would highly recommend always saving the ETL output into a file like this:

java -jar tm_etl.jar ... 2>&1 | tee -a load.log
Kabenla commented 8 years ago

I think I found the problem. The STUDYIDs were probably too long and had characters (in the ClinicalData file) and the loader must have done some truncation/modification. Is there a maximum length or required format for the STUDY_ID column in? In any case a modification of the STUDY_IDs did the trick

jeremy-singer commented 8 years ago

In oracle, tables that have a variety of lengths for study_id. The loading tables have a minimum size of 25 characters, so I think that you are likely to see these truncation problems with study_id values longer than this.

jeremy-singer commented 8 years ago

tranSMART in potgresql also has this 25 character limitation for study_id.