BioPsyk / cleansumstats

Convert GWAS sumstat files into a common format with a common reference for positions, rsids and effect alleles.
https://biopsyk.github.io/metadata/#!/form/cleansumstats
13 stars 2 forks source link

config file options for required variable ontologies #77

Closed AndrewSchork closed 3 years ago

AndrewSchork commented 4 years ago

Consider adding to the config file an ontology option:

ontology_study_PhenoCode="/path/to/ontology.txt" ontology_study_PhenoDesc=NA

if a path is given, require the entry in the meta analysis is in the first column of the file, exactly. This can help with correct typos in key variables, and structuring the data in machine analyzable ways. If the ontology is NA, then do not check the contents.

Check all variables, then throw error for each, to limit multiple rerun to catch multiple errors.

BioPsykOLD commented 4 years ago

Isn't it better to throw an error if the ontology term is not in the ontology map ? And if it is not an error, but actually a new ontology term, then we can quickly add it to out ontology map, which will be to just add one more row to a file.

pappewaio commented 3 years ago

I think, it can suffice for now to - in the README - point the user to the file where they can specify any ontology. but I will move this as a new feature for next release.

rzetterberg commented 3 years ago

How is this issue supposed to be implemented when we have #78?

edit: Just wanted to note that I ask because I want to understand, even though I phrased the question as like "how the hell are we supposed to do this?!" :sweat_smile:

pappewaio commented 3 years ago

My opinion right now is that we should NOT force a specific ontology in metafile. It is very complicated to get this right, so we need time and experience. And it does not matter too much, it is more important for the library (not the cleaning) that this is searchable. Therefore, I think we can tabularize what we have in the library and update the metafiles to keep the names consistent. The most important thing is to give a very detailed description. And as a suggestion from Ida Callesen, we could instead aim to have broad pheno_codes, like: "neurological diseses". This would likely be precis enough to give good searchability in the library.

Precise classification is nice, but I think it is too much at this point for us to create such a system. @AndrewSchork , what are your thoughts?

rzetterberg commented 3 years ago

Alright, so this issue can be closed then?

pappewaio commented 3 years ago

That is my opinion. Maybe @joejeroe can add some thoughts here before we close it completely?

joejeroe commented 3 years ago

I have nothing to add.

AndrewSchork commented 3 years ago

I think this gets taken care of by the new schema which basically has this embedded in the metafile checker. It was my hack-y way to demand that some meta-file entries needed to have limited options.

rzetterberg commented 3 years ago

Yes, that is correct! The new schema has this sort of validation built-in.