Closed bussec closed 1 year ago
Providing such information was already requested in #91, however it was decided that this is not truly minimal information (e.g. it is rather pointless to provide this for lab strain of mice). Given the idea to move it into a human specific extension, this decision has now been revisited.
I had a look for ontologies that could provide the country codes:
country
node has pan-240 leaves, cross-referencing to GAZThe Ensemble database utilizes a simplified way of dealing with population genetics - for example when searching the frequency of a particular snp. They use the 1000 genomes categories of 'Population' and 'Sub-population'.
'Population' is, essentially, a continental or regional designation, with five categories - AFR, AMR, EAS, EUR and SAS (African, American, East Asian, European and South Asian).
Within each of these are various sub-populations - for example, Finnish from the European set or YRI from the African.
There is an advantage to using these classifications as many within the human populations field will already be familiar with this system. It also avoids loaded terms like ethnicity.
I'm not certain that country of origin of the sample is going to be useful in all circumstances - many population studies are done on groups that have ancestries distinct from the country where they currently live (for example the Gujarati in Houston population set of the 1000 genomes study).
Having such a Population and Sub-population classification may also help to keep the samples as anonymous as possible.
One additional thing that could be useful for population studies is an indication of familial relationships. It would be good to have a means to track this (for example indicating one case is the child or sibling of other cases in the same study).
Brief comment on the last point: MiAIRR already includes the linked_subjects
and link_type
fields on the subject
level. We discussed the creation of a more complex structure on the study level a couple of months ago (#308), but did not see any pressing need for it at this stage.
Forwarding from one of @lgcowell's collaborators:
EuPathDB has been using the OBI country name to denote some geographic location
NOTE_1
. When needed we've used GAZ terms as instances [...] However, VectorBase [has an] own representation of geographical locations called VBGEO. See hereNOTE_2
and choose "GADM/VBGEO PlaceNames". [The] goal is to refactor VBGEO into something OBO Foundry compliant, [thinking about] starting with GAZ but open to other ways.
NOTE_1
: OBI_0001627 is just a field, it does not directly link to an ontology.
NOTE_2
: Link currently down (failed DNS lookup), URL confirmed
Thanks Christian. I am confused by Note_1, because OBI is an ontology (Ontology of Biomedical Investigation).
The below-mentioned exchange was quite a long time ago. We could circle back and see if they have made any progress …
Sorry, @lgcowell I realized that my previous comment was not precise: Yes, OBI is of course an ontology, however country name
(aka OBI_0001627
) is only a leaf node within OBI. Therefore it does not provide anything we could use as a controlled vocabulary.
I see. thank you for clarifying.
For country of origin, ISO 3166 seems the obvious choice. It is consistently maintained and unlikely to fall into disuse. Updates are tracked, which could simplify database maintenance over time.
I think it's important to add a population (ethnicity) classification as @MartinMatthewC mentions. Some studies will focus on specific groups and may have a very precise classification that could be difficult to derive from a published ontology. Many other studies may only be able to provide a broad classification along the lines of HANCESTRO's Ancestry Category. One approach could be to have a free text field to support the first use case, and a restricted vocabulary to address the second.
@williamdlees @MartinMatthewC
MiniStd accepted the fields today during our call. This is the first MiAIRR Extension we are implementing, therefore I now created PR #318 as a first draft, as I expect some more discussions on the schema side, that we can resolve in parallel.
The commit right now only contains the fields discussed in the #264 but we can introduce an additional population
field using the 1kGP terms. However, I am still looking for an ontology of those, until now the best source I found was at Ensembl.
Furthermore, assign GAZ as draft ontology to the fields. I don't want to forestall anything with this, just would like to mention GAZ does contain/map to ISO 3166 codes. How complete they are is IMO something for OntoVoc.
Background: Geographical information about the origin of a subject is important information for human population genetics of Ig/TCR loci. However, to be boardly applicable, we should capture information that is frequently available (GPS coordinates are not), allows unambiguous encoding and does not present major challenges for privacy.
Proposal: Add the following two fields into the AIRR Schema, however do not include them in MiAIRR (see Note 2 below):
country_birth
on thesubject
levelcollection_country
on thesample
levelNotes: