EHDEN / ETL-UK-Biobank

ETL UK-Biobank
https://ehden.github.io/ETL-UK-Biobank/
13 stars 4 forks source link

STCM loading #246

Closed alepev closed 3 years ago

alepev commented 3 years ago

Closes #122

alepev commented 3 years ago

The STCM loading works (you'll need to set both skip_custom_vocabulary_loading & skip_source_to_concept_map_loading to False in the configuration to see it in action). However:

alepev commented 3 years ago

p.s. I keep having issues with Usagi when the sourceFrequency or mappingType columns are empty. had to fill fake frequencies (1) and remove the mappingType column from race, gender, and gp_clinical_unit mappings.

alepev commented 3 years ago

Update: on hold until new PKs added to STCM in delphyne (see https://github.com/thehyve/delphyne/issues/158)

alepev commented 3 years ago

The STCM loader works as expected, the problem in in the Hesin mapping tables: some rows have the same source_code and target_concept_id, which causes duplicate PKs when loading to STCM. I had a deeper look at the mappings, the codes seem only to differ for the coding_origin field value (see below). If it's important to keep this distinction for the transformations, I suggest we split the mapping files according to values in this field, and generate different STCM files with different vocabularies (for example, HESIN_ADMIMETH_PEDW, HESIN_ADMIMETH_HES, HESIN_ADMIMETH_SMR). That would ensure the STCM PK is unique.

Screenshot 2021-02-25 at 14 52 49
alepev commented 3 years ago

At the moment the codes above would all end up in the STCM table like this, that's why we get duplicates:

Screenshot 2021-02-25 at 14 55 51
alepev commented 3 years ago

@MaximMoinat fixed! Make sure to turn on both custom vocabulary and stcm options (set to False) in the configuration to try it out with an E2E :)

MaximMoinat commented 3 years ago

@alepev I am getting an error when I turn the custom vocabulary loading on (or actually turning the skipping off :p).

sqlalchemy.exc.IntegrityError: (psycopg2.errors.ForeignKeyViolation) insert or update on table "source_to_concept_map" violates foreign key constraint "fk_source_to_concept_map_source_vocabulary_id_vocabulary"
DETAIL:  Key (source_vocabulary_id)=(MY_VOCAB1) is not present in table "vocabulary".
alepev commented 3 years ago

@alepev I am getting an error when I turn the custom vocabulary loading on (or actually turning the skipping off :p).

sqlalchemy.exc.IntegrityError: (psycopg2.errors.ForeignKeyViolation) insert or update on table "source_to_concept_map" violates foreign key constraint "fk_source_to_concept_map_source_vocabulary_id_vocabulary"
DETAIL:  Key (source_vocabulary_id)=(MY_VOCAB1) is not present in table "vocabulary".

I got this error too, the problem is if the vocabularies mentioned in stcm_version.tsv change, unfortunately the STCM_VERSION table itself doesn't get updated.. The only way at the moment is to manually remove the ~offending vocabularies from~ STCM_VERSION in the database (edited: drop the whole table, it's easier). then it will work. I opened an issue in delphyne to solve this: https://github.com/thehyve/delphyne/issues/161

alepev commented 3 years ago

Yay!