Open AndrewSchork opened 4 years ago
From Joeri:
I came across multiple other ancestry groups to include:
ASN: mixed Asian ancestry
BR: Brazilians --> This was mentioned as ancestry in a meta-analysis. However, Brazilian not an ancestry group. Brazil is highly diverse with AMR, EUR, AFR, and EAS ancestry (with a clear geographical divide).
AS: Asian Ancestry -->unspecified where they are from
I also have these 2, but potentially these can be referred to as AMR? at least when they are "none-white Hispanic". LAT: Latin American ancestry HA: Hispanic ancestry
What do people suggest with these populations? Should we include these as options?
We will used a standard code for the closest matching 1KGP project population. This is because we can infer population data from there. I have updated the ontology. This should allow multiple ancestry codes, separated by commas or some other delimiter.
I updated the google link to the ontology
Both the broad and the narrower populations are there. We need to also update the meta-data description for this variable to say:
study_Ancestry=
# It is important to note the genetic ancestry of the subjects in the study according to a structured code.
# You should choose the 1000 genomes population that best represents this data. After this will be a super
# population: e.g., African (AFR), Native North or South American (AMR), East asian (EAS), European (EUR),
# or South asian (SAS). If a more specific population fits better, use that. If multiple populations are included
# provide a list separated by a comma
# ontology: https://docs.google.com/spreadsheets/d/1qghudJelGssaTbe8CDAOHOk7fhpyDAwEKGkOBMqGb3M/
# external inventories: https://docs.google.com/spreadsheets/d/1NtSyTscFL6lI5gQ_00bm0reoT6yS2tDB3SHhgM7WwSE/
# options: AFR, AMR, EAS, EUR, SAS, <combinations of>, <character string>, missing
# example: study_Population=EUR
# example: study_Population=EUR,EAS
We have a process called prepare_allele_frequency_stats
which uses the value from study_Ancestry
to get allele frequency of the given ancestry for each markername in the sumstats-file.
The AF-file looks something like this:
10:100000222 G A 0 0 0.02 0.01 0
10:100000224 G A 0 0 0 0 0
10:100000235 C T 0.18 0.31 0.31 0.33 0.15
In this AF-file, each row represents the allele frequency of the available ancestries at a given markername. The header of this file would basically be:
chr:pos a1 a2 EAS EUR AFR AMR SAS
So what the process prepare_allele_frequency_stats
does, is that it takes the value in study_Ancestry
and converts it to the column index of this file. It then retrieves the frequency for each markername according to this column index.
To implement the change described in this issue, how would do we solve these scenarios:
EAS
, EUR
, AFR
, AMR
or SAS
should allele frequency retrieval just be disabled? Or should be provide allele frequencies for all the new ancestries?Feel free to correct me if my assumptions about how prepare_allele_frequency_stats
works are wrong, I have derived my assumptions from reading the code, so I might be missing/misunderstanding something.
Good questions!
But I think we should add a lot of allowed anciestries for the meta file, as in the future it might be interesting for the cleaning, but i already might be interesting for downstream applications.
We could start with adding the whole 1000 genomes list of allowed ancestries
In the document Andrew linked there are 32 codes, are there any other codes besides them?
Those are good, looks like 1000 genomes.
If there is a mix, we should disable frequency retrieval. The reason is that the mix will cause the frequency to some kind of average over the population mix and won't be reflected in the allele frequency reference.
But in the example Andrew provided:
You should choose the 1000 genomes population that best represents this data. After this will be a super
population: e.g., African (AFR), Native North or South American (AMR), East asian (EAS), European (EUR), or South asian (SAS).
It seems that you are supposed to use multiple ancestries. First choose the 1000 genomes population and then choose a "super population".
For example if I choose "Southern Han Chinese" which has the code CHS
, I should also add the super population code "South asian" SAS
.
So if that interpretation is correct, then most metadata files will have multiple ancestries. Which means the allele frequency retrieval will be disabled in most cases.
Hmm, ok, right now there is no support for that, but you are right. This is where we can improve. I would like to add that as a feature for next release (which likely would be ready before we are publishing the article). I could scan trough the existing metafiles to see what we have. But I would guess 90% is "EUR" only.
This is where we can improve. I would like to add that as a feature for next release (which likely would be ready before we are publishing the article).
Add what feature?
This is where we can improve. I would like to add that as a feature for next release (which likely would be ready before we are publishing the article).
Add what feature?
Alright! But the description Andrew provided instructed the user to supply both the subgroup and EAS
as a list. But you are talking about deriving EAS
for the subgroups that are included under the EAS
"super population".
So if it's unclear how we solve this and we want this to be in the next release, then I'll move it from the "beta" milestone.
it is typical to use the super populations (EUR, EAS, SAS, AMR, AFR) in place of their component populations (e.g., CHS). This is become most GWAS are conducted at this "continental" level of ancestry. I think for simplicity, we should start with just these 5 populations as options, and maybe a NA (not available) and/or NM (no match) and/or TA (trans-ancestry). Using these 5 is a kind of "state-of-the-field" standard, although that more due to intuition and convenience than any kind of process optimization, so I would go with that approach.
I think if we go to the sub-populations (the 32 codes) the alleles frequencies may be computed from too few people to be useful.
The downstream sensitivities of this choice to other functions of the pipeline (inferring missing stats and QC'ing the sumstats) is an open scientific question and we will have to do some analyses to better understand its importance. These could be part of a publication about the pipeline.
We could allow user to enter in the meta-file, any of the 32 sub-population codes or 5 super-population, but translate to a super-population code behind the scenes for the actual calculations
Alright, thanks for the background info!
I don't see any technical problems with allowing the user to enter a mixture of sub- and super-population as a list in the study_Ancestry
field and then figuring out how allele frequencies should be computed.
But I do think that the user could be confused easily since they could enter the populations in two different ways and get the same result. For example, if their study used people from Finland and the UK, they could enter the following values into study_Ancestry
and get the same result (allele frequency calculated from EUR
):
EUR, FIN, GBR
(explicitly giving the super-population)FIN, GBR
(super-population derived from sub-populations)Maybe it won't be an issue with the right instructions, but that's one thing that popped into my head spontaneously.
Oops, I had a response to this but i got distracted and it didnt send, apparently. This is a good point because the ancestry has two purposes.
1) annotating the file with useful information for use downstream use cases (e.g., I want to make predictions in a specific population, so I want to select matching sumstats) 2) The pipeline needs to pull allele frequency information to complete the cleaned sumstats and (eventually) do some QC.
I think for (2) we should use super populations - an unproven intuition - because of the sample size issue. For (1) having the finest grain information is best. This could be a pheno code / pheno description like solution, where you pull down a menu and select a super population, but can free text something more specific?
I think this would solve most issues, but adds an extra variable.
It's another issue where going fully dependent on free text wont let us do the auto QC because we need an "entered population" -> "KGP super population" ontology to pull allele frequencies and I do not trust user text to be easily parsable.
This could be a pheno code / pheno description like solution, where you pull down a menu and select a super population, but can free text something more specific?
Sure, it could even be that you are only presented with sub populations within the selected super populations you selected in the other field.
If we are using the web form we can polish the user experience as much as we want. The details and complexities of the actual metadata file can be hidden.
Also in the description for that field we can add that only the super population has support for extracting allelefrequencies from 1KG, in the present version.
I think if we split the information for the goals i described above into two variables, having the labeling one might be free test might be optimal, since the compositions could get quite unique and definitely beyond the KGP sampling schema. For the fixing the stats, KGP superpopulations (EUR, EAS, SAS, AFR, AMR) are probably sufficient
construct ontology for study_Ancestry
this one requires some thought. How many leaves of the tree do we want? how do we standardize? Are their publish ontologies we could steal?