BioPsyk / cleansumstats

Convert GWAS sumstat files into a common format with a common reference for positions, rsids and effect alleles.
https://biopsyk.github.io/metadata/#!/form/cleansumstats
14 stars 2 forks source link

Change metadata schema according to web form feedback #143

Closed pappewaio closed 3 years ago

pappewaio commented 3 years ago

This will revolutionise the whole metafile concept, and I am excited about the quality we will be able to offer. I took a look at the most recent version that @rzetterberg sent on slack and tried to fill it in and see if I could come up with feedback, and here it is:


Ok I stop here for now 📟

rzetterberg commented 3 years ago

Wow, this is great feedback :1st_place_medal:

rzetterberg commented 3 years ago

During the datacore meeting we also spoke about adding titles for each fields, instead of using their unique key.

AndrewSchork commented 3 years ago

Some thoughts:

How much flexibility do we have in the design of the page? I'm wondering if there is a way to make it more compact, where we show minimal information unless it is requested - hidden under an info button, for example. See this:

Screen Shot 2021-02-22 at 12 38 10

I am fine with titles instead of variable names. Some suggestions:

cleansumstats_metafile_user // User cleansumstats_metafile_date // Date -> could be auto generated when clicking the download? path_sumStats // GWAS sumstats: path_readMe // Sumstats documentation:

study_PMID // Pubmed ID for the publication associated with sumstats: study_Title // Publication title: study_Year // Publication year: path_pdf // Publication PDF: path_supplementary // Publication supplementary information:

study_PhenoDesc // Free description of the trait associated with the GWAS sumstats study_PhenoCode // Standardized phenotype code

study_FilePortal // URL for GWAS sumstats repository study_FileURL // URL for direct download study_AccessDate // Date of download study_Use // Are the GWAS sumstats publicly shared?

study_includedCohorts // Contributing GWAS cohorts: study_Ancestry // Ancestry of GWAS cohorts: study_Gender // Gender of GWAS cohorts: study_PhasePanel // Phasing reference panel: study_PhaseSoftware // Phasing software: study_ImputePanel // Imputation reference panel: study_ImputeSoftware // Imputation software: study_Array // Genotyping array: study_Notes // Special considerations and notes:

stats_TraitType // GWAS trait type: stats_Model // GWAS statistical model: stats_TotalN // Total sample size stats_CaseN // Case sample size stats_ControlN // Control sample size stats_GCMethod // Approach to genomic inflation correction (GC): stats_GCValue // GC adjustment factor stats_Notes // Special considerations and notes:

col_CHR // Chromosome: col_POS // Position: col_SNP // SNP Identifier: col_EffectAllele // Statistical effect reference allele (EA): col_OtherAllele // Other allele (OA): col_BETA // Per allele effect (i.e., regression coefficient, beta): col_SE // Standard error of beta: col_OR // Odds ratio (OR): col_ORL95 // Upper 95% confidence bound of OR: col_ORU95 // Lower 95% confidence bound of OR: col_Z // Hypothesis test statistic (e.g., Z, Wald, t): col_P // P-value col_N // Per SNP total sample size col_CaseN // Per SNP case sample size col_ControlN // Per SNP control sample size col_INFO // Imputation quality score col_EAF // EA frequency in the GWAS cohorts col_OAF // OA frequency in the GWAS cohorts colDirection // Cohort effect directions (e.g., +++) col_Notes // Special considerations and notes:

General Notes:

cleansumstats_version The version should be filled in by the pipeline, automatically. It's more about tracing what was done, rather than annotating the raw sumstats file.

study_Use Perhaps if the answer is no to the above question, then a prompt appears to fill in restrictions and an "owner" or "contact"?

study/stats_notes Do we need to notes sections? Could we concatenate answers from other sections?

col_AFREQ Depricated by EAF and OAF - we can remove

rzetterberg commented 3 years ago

How much flexibility do we have in the design of the page? I'm wondering if there is a way to make it more compact, where we show minimal information unless it is requested - hidden under an info button, for example. See this:

100%, because I ended up writing my own form implementation :smile:

The existing implementations were very entangled in different javascript-frameworks, so using them would make it really hard to do custom forms and fields. So I decided to write my own implementation since it was trivial.

pappewaio commented 3 years ago

Having the form opens up a new world when it comes to notes like "study/stats_notes". @rzetterberg would it be possible to add a "default note" button to many fields and collect notes from different sections? Then we could remove the specific study/stats notes

rzetterberg commented 3 years ago

would it be possible to add a "default note" button to many fields and collect notes from different sections?

Yes, no problem! All field widgets can access the data structure that contains all field values. So you can easily pick values from other fields for auto-fill.

rzetterberg commented 3 years ago

Right now, the fields stats_ControlN, stats_CaseN and stats_TotalN have the type number. This means that you can enter real numbers and that will be valid.

But they should be natural numbers, right? You can't have a study where you used 1000.5 cases, right?

AndrewSchork commented 3 years ago

yes, decimals dont make sense for the "N" variables.

rzetterberg commented 3 years ago
AndrewSchork commented 3 years ago

Sorry, I'm a bit late to this.

There is important information here that we do need to "require" somehow - How do we cite this data when we use it?

PMID, ensures we can cite the data when used

if no PMID

doi preprint link does the same

if no preprint link

we probably need a contact or data owner. Perhaps one solution is that if there is no PMID or preprint link, the data can not be listed as public, but is set to restricted and a contact (name, email) is required.

citing data is critical! :-)

rzetterberg commented 3 years ago

What we can do is setting the field to required again and adding a third option which represents studies that does not have a pubmed iD or a DOI link:

2021-03-01-163753_937x250_scrot

This third option could be any type of reference, but by doing this it forces the user to make a conscious decision on what reference to use. This will hopefully avoid situations where people do have a pubmed or DOI link, but they miss filling it in, when the field was not required.

rzetterberg commented 3 years ago

we probably need a contact or data owner. Perhaps one solution is that if there is no PMID or preprint link, the data can not be listed as public, but is set to restricted and a contact (name, email) is required.

The third option I suggested above could also be used to fill that in, if we'd like.

pappewaio commented 3 years ago

Sounds good to have it like that. Then people will feel forced to enter the correct ID

rzetterberg commented 3 years ago

I've updated the metadata form so that you can create a metadata file either from scratch or from an existing metadata file.

I have fixed most of the feedback from this issue, can you please have a look at the current version of the form and tell me if you are content with the changes or if we should change something else? @pappewaio @AndrewSchork

Here's the URL: https://biopsyk.github.io/metadata/

You should see something like this:

2021-03-16-100205_1482x916_scrot

If you don't you need to clear the browser cache so that the latest files are downloaded.

AndrewSchork commented 3 years ago

Amazing! A few comments/suggestions:

Publication ID - can the pubmed ID be the default (first) tab?

Are the GWAS sumstats publicly shared?*
If restricted is selected, we need a place to report an owner/contact email

Genotyping Array* (not required)

I would like to edit/modify some of the ontologies - names in the drop down menus, descriptions, etc.

Is there a place I can do that? In the coding formats? Or is it best to put them in an intermediate place for you to update?

joejeroe commented 3 years ago

I have a list of adjustment:

  1. Publication PDF Name to the study PDF as referenced in the study_PMID field. This file must be located in the same directory as the metadata file.

-This is a bit confusing, so just give an example e.g., Gadin_bioinf2018.pdf

  1. Publication title * Title of the PMID'd publication associated with the stats. Should be one line (no new line characters) and no tabs. All other characters are acceptable.

-Give an example: "A genome-wide association study of shared risk across psychiatric disorders implicates gene regulation during fetal neurodevelopment"

  1. URL for GWAS sumstats repository and URL for direct download

-Give example

  1. Genotyping Array

-This is a small number of options? Don't we have more in our inventory already?

  1. Contributing GWAS cohorts

-Why is this not a drop-down menu but something that needs to included extra?

  1. Building on what Andrew said. After somebody said "restricted data" this can only be finalised when the study controller is provided as well. Else you risk having datasets that are reported restricted and nobody knows why when people move out of the IBP.
rzetterberg commented 3 years ago

Please see my answers below to your questions. All requests without questions have been added as tasks at the end of this comment.

Andrew: Publication ID - can the pubmed ID be the default (first) tab?

Yes, the tabs are rendered in the order that the options appear in the schema, so it's just a matter of moving the "pubmid ID" option first.

Andrew: If restricted is selected, we need a place to report an owner/contact email

Joeri: 6. Building on what Andrew said. After somebody said "restricted data" this can only be finalised when the study controller is provided as well. Else you risk having datasets that are reported restricted and nobody knows why when people move out of the IBP.

We could show additional fields when this option is selected. What additional fields would you like? Two new required fields, one for name and one for email?

Andrew: I would like to edit/modify some of the ontologies - names in the drop down menus, descriptions, etc. Is there a place I can do that? In the coding formats? Or is it best to put them in an intermediate place for you to update?

Yes, we can setup your local computer with a copy of the web form and the schema, then you can edit the schema locally and preview the web form after each change. Then when you are done you can commit the changed schema file to GitHub.

I'll reach out to you on slack later today when I have prepared the web form for local use.

Joeri: 4. Genotyping Array This is a small number of options? Don't we have more in our inventory already?

I took these options from the Google Doc that was provided in the original metadata template file (https://docs.google.com/spreadsheets/d/1qghudJelGssaTbe8CDAOHOk7fhpyDAwEKGkOBMqGb3M/edit#gid=321787909)

Basically, all dropdown options you see in the form was taken from that document.

Joeri: 5. Contributing GWAS cohorts Why is this not a drop-down menu but something that needs to included extra?

The way this field is supposed to work is that the user should be able to select one or many of these values:

- "none"
- "iPSYCH2012"
- "iPSYCH2015"
- "UKB"
- "GEMS"

But I haven't finished implementing drop downs with multiple selections yet, so it's using the standard way of showing array fields, which is a "New" button and a list of values.

Todos