airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Add country information to subject and sample #265

Closed bussec closed 1 year ago

bussec commented 5 years ago

Background: Geographical information about the origin of a subject is important information for human population genetics of Ig/TCR loci. However, to be boardly applicable, we should capture information that is frequently available (GPS coordinates are not), allows unambiguous encoding and does not present major challenges for privacy.

Proposal: Add the following two fields into the AIRR Schema, however do not include them in MiAIRR (see Note 2 below):

Notes:

  1. Choice of encoding:
    • ISO3166-1 contains also historical entities and allows for subdivision of federal states via ISO3166-2.
    • Alpha-2 coding is well known, as it is also used for internet top-level domains.
    • Alternatives: a. ISO3166-1 Alpha-3: Less known, increase in human-readability is marginal b. ISO3166-1 English short name (ALL CAPS): Manual encoding error-prone, requires implementation as controlled vocabulary. c. UN Stats Division code (currently M49): Numerical code, not human-readable
  2. Non-MiAIRR status: The two fields are expected to be moved (along with a set of other human-specific fields) into the proposed Human Population Genetics Extension (#264 ), which would remove them (again) from the MiAIRR core specification. Hence it seems appropriate not to include them at this point.
bussec commented 5 years ago

Providing such information was already requested in #91, however it was decided that this is not truly minimal information (e.g. it is rather pointless to provide this for lab strain of mice). Given the idea to move it into a human specific extension, this decision has now been revisited.

bussec commented 5 years ago

I had a look for ontologies that could provide the country codes:

MartinMatthewC commented 4 years ago

The Ensemble database utilizes a simplified way of dealing with population genetics - for example when searching the frequency of a particular snp. They use the 1000 genomes categories of 'Population' and 'Sub-population'. 'Population' is, essentially, a continental or regional designation, with five categories - AFR, AMR, EAS, EUR and SAS (African, American, East Asian, European and South Asian). Within each of these are various sub-populations - for example, Finnish from the European set or YRI from the African.
There is an advantage to using these classifications as many within the human populations field will already be familiar with this system. It also avoids loaded terms like ethnicity. I'm not certain that country of origin of the sample is going to be useful in all circumstances - many population studies are done on groups that have ancestries distinct from the country where they currently live (for example the Gujarati in Houston population set of the 1000 genomes study). Having such a Population and Sub-population classification may also help to keep the samples as anonymous as possible. One additional thing that could be useful for population studies is an indication of familial relationships. It would be good to have a means to track this (for example indicating one case is the child or sibling of other cases in the same study).

bussec commented 4 years ago

Brief comment on the last point: MiAIRR already includes the linked_subjects and link_type fields on the subject level. We discussed the creation of a more complex structure on the study level a couple of months ago (#308), but did not see any pressing need for it at this stage.

bussec commented 4 years ago

Forwarding from one of @lgcowell's collaborators:

EuPathDB has been using the OBI country name to denote some geographic location NOTE_1. When needed we've used GAZ terms as instances [...] However, VectorBase [has an] own representation of geographical locations called VBGEO. See here NOTE_2 and choose "GADM/VBGEO PlaceNames". [The] goal is to refactor VBGEO into something OBO Foundry compliant, [thinking about] starting with GAZ but open to other ways.

NOTE_1: OBI_0001627 is just a field, it does not directly link to an ontology. NOTE_2: Link currently down (failed DNS lookup), URL confirmed

bussec commented 4 years ago

Gazetteer seems worth looking at...

lgcowell commented 4 years ago

Thanks Christian. I am confused by Note_1, because OBI is an ontology (Ontology of Biomedical Investigation).

The below-mentioned exchange was quite a long time ago. We could circle back and see if they have made any progress …

bussec commented 4 years ago

Sorry, @lgcowell I realized that my previous comment was not precise: Yes, OBI is of course an ontology, however country name (aka OBI_0001627) is only a leaf node within OBI. Therefore it does not provide anything we could use as a controlled vocabulary.

lgcowell commented 4 years ago

I see. thank you for clarifying.

williamdlees commented 4 years ago

For country of origin, ISO 3166 seems the obvious choice. It is consistently maintained and unlikely to fall into disuse. Updates are tracked, which could simplify database maintenance over time.

I think it's important to add a population (ethnicity) classification as @MartinMatthewC mentions. Some studies will focus on specific groups and may have a very precise classification that could be difficult to derive from a published ontology. Many other studies may only be able to provide a broad classification along the lines of HANCESTRO's Ancestry Category. One approach could be to have a free text field to support the first use case, and a restricted vocabulary to address the second.

bussec commented 4 years ago

@williamdlees @MartinMatthewC

MiniStd accepted the fields today during our call. This is the first MiAIRR Extension we are implementing, therefore I now created PR #318 as a first draft, as I expect some more discussions on the schema side, that we can resolve in parallel.

The commit right now only contains the fields discussed in the #264 but we can introduce an additional population field using the 1kGP terms. However, I am still looking for an ontology of those, until now the best source I found was at Ensembl.

Furthermore, assign GAZ as draft ontology to the fields. I don't want to forestall anything with this, just would like to mention GAZ does contain/map to ISO 3166 codes. How complete they are is IMO something for OntoVoc.