ontology: study_Ancestry

AndrewSchork commented 4 years ago

construct ontology for study_Ancestry

this one requires some thought. How many leaves of the tree do we want? how do we standardize? Are their publish ontologies we could steal?

AndrewSchork commented 4 years ago

From Joeri:

I came across multiple other ancestry groups to include:

ASN: mixed Asian ancestry

BR: Brazilians --> This was mentioned as ancestry in a meta-analysis. However, Brazilian not an ancestry group. Brazil is highly diverse with AMR, EUR, AFR, and EAS ancestry (with a clear geographical divide).

AS: Asian Ancestry -->unspecified where they are from

I also have these 2, but potentially these can be referred to as AMR? at least when they are "none-white Hispanic". LAT: Latin American ancestry HA: Hispanic ancestry

What do people suggest with these populations? Should we include these as options?

AndrewSchork commented 4 years ago

https://docs.google.com/spreadsheets/d/1JowLmxixDu7oDYDtG984UZNU8HOxlFOZJq8bHD8jJ4E/edit?usp=sharing

see study_Ancestry tab.

AndrewSchork commented 3 years ago

We will used a standard code for the closest matching 1KGP project population. This is because we can infer population data from there. I have updated the ontology. This should allow multiple ancestry codes, separated by commas or some other delimiter.

I updated the google link to the ontology

Both the broad and the narrower populations are there. We need to also update the meta-data description for this variable to say:

study_Ancestry=
# It is important to note the genetic ancestry of the subjects in the study according to a structured code.
# You should choose the 1000 genomes population that best represents this data.  After this will be a super 
# population: e.g., African (AFR), Native North or South American (AMR), East asian (EAS), European (EUR), 
# or South asian (SAS).  If a more specific population fits better, use that.  If multiple populations are included 
# provide a list separated by a comma
# ontology: https://docs.google.com/spreadsheets/d/1qghudJelGssaTbe8CDAOHOk7fhpyDAwEKGkOBMqGb3M/
# external inventories: https://docs.google.com/spreadsheets/d/1NtSyTscFL6lI5gQ_00bm0reoT6yS2tDB3SHhgM7WwSE/
# options: AFR, AMR, EAS, EUR, SAS, <combinations of>, <character string>, missing
# example: study_Population=EUR
# example: study_Population=EUR,EAS

rzetterberg commented 3 years ago

We have a process called prepare_allele_frequency_stats which uses the value from study_Ancestry to get allele frequency of the given ancestry for each markername in the sumstats-file.

The AF-file looks something like this:

10:100000222    G       A       0       0       0.02    0.01    0
10:100000224    G       A       0       0       0       0       0
10:100000235    C       T       0.18    0.31    0.31    0.33    0.15

In this AF-file, each row represents the allele frequency of the available ancestries at a given markername. The header of this file would basically be:

chr:pos    a1    a2    EAS    EUR    AFR    AMR    SAS

So what the process prepare_allele_frequency_stats does, is that it takes the value in study_Ancestry and converts it to the column index of this file. It then retrieves the frequency for each markername according to this column index.

To implement the change described in this issue, how would do we solve these scenarios:

When multiple ancestries are given, which one should be used to retrieve the allele frequency from the AF-file?
When a single ancestry is given, that is not EAS, EUR, AFR, AMR or SAS should allele frequency retrieval just be disabled? Or should be provide allele frequencies for all the new ancestries?

Feel free to correct me if my assumptions about how prepare_allele_frequency_stats works are wrong, I have derived my assumptions from reading the code, so I might be missing/misunderstanding something.

pappewaio commented 3 years ago

Good questions!

If there is a mix, we should disable frequency retrieval. The reason is that the mix will cause the frequency to some kind of average over the population mix and won't be reflected in the allele frequency reference.
We should specify in the documentation which ancestries that we can process right now, if the gwas that is being cleaned has something else, then this addition of allele frequency is disabled. For our own analyses (and others) these ancestries are more than enough to do what we do, so there is no priority to include more.

pappewaio commented 3 years ago

But I think we should add a lot of allowed anciestries for the meta file, as in the future it might be interesting for the cleaning, but i already might be interesting for downstream applications.

pappewaio commented 3 years ago

We could start with adding the whole 1000 genomes list of allowed ancestries

rzetterberg commented 3 years ago

In the document Andrew linked there are 32 codes, are there any other codes besides them?

pappewaio commented 3 years ago

Those are good, looks like 1000 genomes.

rzetterberg commented 3 years ago

If there is a mix, we should disable frequency retrieval. The reason is that the mix will cause the frequency to some kind of average over the population mix and won't be reflected in the allele frequency reference.

But in the example Andrew provided:

  You should choose the 1000 genomes population that best represents this data. After this will be a super
population: e.g., African (AFR), Native North or South American (AMR), East asian (EAS), European (EUR), or South asian (SAS).

It seems that you are supposed to use multiple ancestries. First choose the 1000 genomes population and then choose a "super population".

For example if I choose "Southern Han Chinese" which has the code CHS, I should also add the super population code "South asian" SAS.

So if that interpretation is correct, then most metadata files will have multiple ancestries. Which means the allele frequency retrieval will be disabled in most cases.

pappewaio commented 3 years ago

Hmm, ok, right now there is no support for that, but you are right. This is where we can improve. I would like to add that as a feature for next release (which likely would be ready before we are publishing the article). I could scan trough the existing metafiles to see what we have. But I would guess 90% is "EUR" only.

rzetterberg commented 3 years ago

This is where we can improve. I would like to add that as a feature for next release (which likely would be ready before we are publishing the article).

Add what feature?

pappewaio commented 3 years ago

This is where we can improve. I would like to add that as a feature for next release (which likely would be ready before we are publishing the article).

Add what feature?

To allow more than one ancestry, like you described: If a subgroup of EAS is included then we give them EAS instead of that specific subgroup.
And maybe we can produce an allele frequency reference that includes all subgroups as well.

rzetterberg commented 3 years ago

Alright! But the description Andrew provided instructed the user to supply both the subgroup and EAS as a list. But you are talking about deriving EAS for the subgroups that are included under the EAS "super population".

So if it's unclear how we solve this and we want this to be in the next release, then I'll move it from the "beta" milestone.

AndrewSchork commented 3 years ago

it is typical to use the super populations (EUR, EAS, SAS, AMR, AFR) in place of their component populations (e.g., CHS). This is become most GWAS are conducted at this "continental" level of ancestry. I think for simplicity, we should start with just these 5 populations as options, and maybe a NA (not available) and/or NM (no match) and/or TA (trans-ancestry). Using these 5 is a kind of "state-of-the-field" standard, although that more due to intuition and convenience than any kind of process optimization, so I would go with that approach.

I think if we go to the sub-populations (the 32 codes) the alleles frequencies may be computed from too few people to be useful.

The downstream sensitivities of this choice to other functions of the pipeline (inferring missing stats and QC'ing the sumstats) is an open scientific question and we will have to do some analyses to better understand its importance. These could be part of a publication about the pipeline.

We could allow user to enter in the meta-file, any of the 32 sub-population codes or 5 super-population, but translate to a super-population code behind the scenes for the actual calculations

rzetterberg commented 3 years ago

Alright, thanks for the background info!

I don't see any technical problems with allowing the user to enter a mixture of sub- and super-population as a list in the study_Ancestry field and then figuring out how allele frequencies should be computed.

But I do think that the user could be confused easily since they could enter the populations in two different ways and get the same result. For example, if their study used people from Finland and the UK, they could enter the following values into study_Ancestry and get the same result (allele frequency calculated from EUR):

EUR, FIN, GBR (explicitly giving the super-population)
FIN, GBR (super-population derived from sub-populations)

Maybe it won't be an issue with the right instructions, but that's one thing that popped into my head spontaneously.

AndrewSchork commented 3 years ago

Oops, I had a response to this but i got distracted and it didnt send, apparently. This is a good point because the ancestry has two purposes.

1) annotating the file with useful information for use downstream use cases (e.g., I want to make predictions in a specific population, so I want to select matching sumstats) 2) The pipeline needs to pull allele frequency information to complete the cleaned sumstats and (eventually) do some QC.

I think for (2) we should use super populations - an unproven intuition - because of the sample size issue. For (1) having the finest grain information is best. This could be a pheno code / pheno description like solution, where you pull down a menu and select a super population, but can free text something more specific?

I think this would solve most issues, but adds an extra variable.

It's another issue where going fully dependent on free text wont let us do the auto QC because we need an "entered population" -> "KGP super population" ontology to pull allele frequencies and I do not trust user text to be easily parsable.

rzetterberg commented 3 years ago

This could be a pheno code / pheno description like solution, where you pull down a menu and select a super population, but can free text something more specific?

Sure, it could even be that you are only presented with sub populations within the selected super populations you selected in the other field.

If we are using the web form we can polish the user experience as much as we want. The details and complexities of the actual metadata file can be hidden.

pappewaio commented 3 years ago

Also in the description for that field we can add that only the super population has support for extracting allelefrequencies from 1KG, in the present version.

AndrewSchork commented 3 years ago

I think if we split the information for the goals i described above into two variables, having the labeling one might be free test might be optimal, since the compositions could get quite unique and definitely beyond the KGP sampling schema. For the fixing the stats, KGP superpopulations (EUR, EAS, SAS, AFR, AMR) are probably sufficient

BioPsyk / cleansumstats

ontology: study_Ancestry #83