Open sync-by-unito[bot] opened 4 months ago
➤ Anastasia Kytölä commented:
It looks like the new data is in different format compared to the previous one (example: gs://r12-data/colocalization/release/formatted_v1/fg_r12_ukbb_ppp.txt.gz ) meaning that I cannot use the established workflow for importing data as-is. Currently, data should follow finngen common data model format in order to be imported into the colocalization/causal_variant tables in SQL instance by pheweb colocalization cli so that the models can further be used by Pheweb backend. Please, Mitja Kurki advise what should be done. To me re-formatting of the data sounds like the easiest approach compared to large updates of the finngen common data model and pheweb backend, but maybe there is some other solution that I don’t see.
This is the error I get when using the pheweb colocalization cli for data import:
AssertionError: header expected '['source1', 'source2', 'pheno1', 'pheno1_description', 'pheno2', 'pheno2_description', 'quant1', 'quant2', 'tissue1', 'tissue2', 'locus_id1', 'locus_id2', 'chrom', 'start', 'stop', 'clpp', 'clpa', 'vars', 'len_cs1', 'len_cs2', 'len_inter', 'vars1_info', 'vars2_info', 'source2_displayname', 'beta1', 'beta2', 'pval1', 'pval2']'
got '['dataset1', 'dataset2', 'trait1', 'trait2', 'region1', 'region2', 'cs1', 'cs2', 'nsnps', 'hit1', 'hit2', 'PP.H0.abf', 'PP.H1.abf', 'PP.H2.abf', 'PP.H3.abf', 'PP.H4.abf', 'low_purity1', 'low_purity2', 'nsnps1', 'nsnps2', 'cs1_log10bf', 'cs2_log10bf', 'csj1_log10bf', 'csj2_log10bf', 'clpp', 'clpa', 'cs1_size', 'cs2_size', 'cs_overlap', 'topInOverlap', 'hit1_info', 'hit2_info', 'colocRes']'
Looks like some of these columns can be just renamed or reformatted slightly, but also some of them seem to be missing completely (and of course even with modifying the file accordingly, I will still have to update finngen common data model / pheweb backend in order to include new PP.H4.abf).
Links:
➤ Mitja Kurki commented:
Anastasia Kytölä yes reformat probably would be a good idea and adding a column. Could you give me example rows of both and do suggested mappings and I will fill the rest ..
➤ Anastasia Kytölä commented:
Mitja Kurki I used the documentation from the previous release gs://finngen-production-library-green/finngen_R12/finngen_R12_analysis_data/colocalization/data_dictionary.txt - I marked the columns that seem to be mapping and those that don't. I think all of the columns from the prev format are required by the finngen common data model + an additional source2_displayname that we use for prettier source name representation. There is a lot of extra columns in the new data and they might still be used and it would be easier to map them if there was a similar documentation for the new data as well.
➤ Mitja Kurki commented:
Anastasia Kytölä the new format description is here. https://finngen.gitbook.io/finngen-handbook/working-in-the-sandbox/running-analyses-in-sandbox/how-to-run-colocalization-pipeline
➤ Mitja Kurki commented:
quant and tissue need to be parsed from dataset2 field. e.g. Alasoo_2018--macrophage_IFNg--exon--eQTL_Catalogue. Datasource= Alasoo_2018--eQTLcatalog, tissue2= macrophage-ifn-g, quant2=exon, Alasoo_2018--macrophage_IFNg--exon--eQTL_Catalogue
➤ Mitja Kurki commented:
there can be less fields also separated by double dash e.g. fg endpoitns have 2 source + type
➤ Mitja Kurki commented:
Can you enumerate all possible combos of dataset fields for checking
➤ Mitja Kurki commented:
start/stop need to be inferred from intersection of these regions.. region1 region2 so just report region that is overlapping in those
➤ Mitja Kurki commented:
vars info. Hit1;NA;2 fields from hitx_info
➤ Mitja Kurki commented:
vars leave empty
➤ Mitja Kurki commented:
see above Anastasia Kytölä We have team leader meetings almost all day but there is a lunch break 11.30-12.30. We are in Biomedicum 1 (3rd floor), meeting room 5-6. We could have a chat outside that at 11.30 ?
➤ Anastasia Kytölä commented:
OK Mitja Kurki Thanks for the specifications
➤ Anastasia Kytölä commented:
Region1 & region2 from which we have to deduce chrom,start,stop have chromosome ranges chr1-23,chrX - confused with that a bit. Is there a reason why some of the entries have chrX and some chr23? Otherwise, will rename chrX -> chr23.
➤ Anastasia Kytölä commented:
What do these postfixes mean in the endpoint name: EXMORE, EXALLC, ALLW (needed for constructing pheno description column)? Saw something like this in the old data:
➤ Anastasia Kytölä commented:
I guess "ALLW" stands for "all women as controls"?
➤ Anastasia Kytölä commented:
It seems that we don't have enough values for construction of varsN_info columns: in the prev data we have VAR_ID,PIP,BETA values listing all variants from the credible sets and separated by semicolon. In the new data we can use columns hitN and hitN_info for getting VAR_ID and BETA but we don’t have the PIP, we have P-VALUE of the top variant instead. Columns varsN_info are used to generate causal_variant table which is shown on a couple of pages in PheWeb. I will put NAs to the PIP values for now and will document this.
➤ Anastasia Kytölä commented:
Mitja Kurki
➤ Mitja Kurki commented:
Yea NA for niw is good!
➤ Anastasia Kytölä commented:
Mitja KurkiArto Lehisto
Reformatted the data and imported it to the temporary database analysis_r12_v1 in the production-releases-pheweb-database Cloud SQL instance. Made 2 PRs for updates:
Next steps would be:
One issue that I had in the past is that I couldn't update colocalization/causal_variant tables directly while they were being used by the PheWeb. This is the workaround that I did:
Here is full documentation for the updates: gs://bucket-anastasia/pheweb/colocalization/r12/new_colocs_04092024/readme.md (phewas-development project).
Should adjust the db schema and add new column there + UI
Coloc data in here gs://zz-red/pipeline/resources/R12_coloc/colocQC.tsv.gz
The main new column we want to show is PP.H4.abf and in the ui the table should be sorted by that by default.
┆Issue is synchronized with this Wrike task by Unito ┆Attachments: columns_mapping.xlsx