BioPsyk / cleansumstats

Convert GWAS sumstat files into a common format with a common reference for positions, rsids and effect alleles.
https://biopsyk.github.io/metadata/#!/form/cleansumstats
13 stars 2 forks source link

ontology: study_phenoCode #78

Closed AndrewSchork closed 3 years ago

AndrewSchork commented 4 years ago

construct an ontology for the meta variable study_phenoCode

AndrewSchork commented 4 years ago

https://docs.google.com/spreadsheets/d/1JowLmxixDu7oDYDtG984UZNU8HOxlFOZJq8bHD8jJ4E/edit#gid=306635928

review and comment

rzetterberg commented 3 years ago

@AndrewSchork, if I understand this correctly, we want to make sure that the users of the pipeline can only supply study_phenoCode codes that are part of the list of codes in the document you linked to?

pappewaio commented 3 years ago

see my comment in issue #77

My opinion right now is that we should NOT force a specific ontology in metafile. It is very complicated to get this right, so we need time and experience. And it does not matter too much, it is more important for the library (not the cleaning) that this is searchable. Therefore, I think we can tabularize what we have in the library and update the metafiles to keep the names consistent. The most important thing is to give a very detailed description. And as a suggestion from Ida Callesen, we could instead aim to have broad pheno_codes, like: "neurological diseses". This would likely be precis enough to give good searchability in the library.

Precise classification is nice, but I think it is too much at this point for us to create such a system. @AndrewSchork , what are your thoughts?

rzetterberg commented 3 years ago

Alright, so closing this issue is fine also?

pappewaio commented 3 years ago

Yes, we can continue the discussion in #77 . We don't need two issues for this 😸

AndrewSchork commented 3 years ago

I actually think a solution here is important. We do need a structured phenotype ontology so we know what the heck each study is actually studying. Free text is great one by one, but any batch operations will need an unambiguous annotation of traits to files. We could have a quick meeting to discuss the level at which this should be implemented.

See this example: https://atlas.ctglab.nl/traitDB

Without a well defined set of nodes (here "uniqTrait"), we won't be able to group them because everything group will have infinite free text possibilities and we will never be able to group by things like, in this example, domain, chapter, etc. It becomes important for analyses like Figure 3 in this paper:

https://www.nature.com/articles/s41586-020-03145-z

where you have thousands of items but to make sense of them they need to be grouped. Forcing the choice of a uniqTrait allows them to be grouped and ported to different group structures. I'd be happy to hear other solutions.

AndrewSchork commented 3 years ago

I think the issue number 77 was different - how can we tell the pipeline where the ontology is for each variable. Your new solution with the meta-data file format is much smoother.

rzetterberg commented 3 years ago

A meeting sounds like a good idea! Technically this is trivial to implement, since we just give the schema a list of the allowed values and the validator will handle the rest.

What we seem to be disagreeing on is whether or not we should force the user to use our phenotype definitions and if we should, what actual definitions should we use.

pappewaio commented 3 years ago

I am inspired by Richards solution to how he has stored different aggregations of ICD codes, which I think we should apply. But not in the cleaning itself. We should talk about it on zoom so we all agree.