HumanCellAtlas / metadata-schema

This repo is for the metadata schemas associated with the HCA
Apache License 2.0
64 stars 32 forks source link

Determine guidelines for harmonised representation of organs and systems across datasets. #1133

Closed zperova closed 4 years ago

zperova commented 4 years ago

Wranglers need to determine guidelines for choosing the organ and organ part, as assignment to a system if needed.

Currently, there are inconsistencies in ingested datasets for organ and organ fields. For example:

In the Census of Immune Cells: https://data.humancellatlas.org/explore/projects/cc95ff89-2e68-4a08-a234-480eca21ce79 Organ: immune system Organ part: bone marrow

In Bone marrow plasma cells from hip replacement surgeries: https://data.humancellatlas.org/explore/projects/a29952d9-925e-40f4-8a1c-274f118f1f51 Organ: hematopoietic system Organ part: bone marrow

zperova commented 4 years ago

An overview of organs/organ parts can be found in the following ticket: https://github.com/HumanCellAtlas/hca-data-wrangling/issues/319#issuecomment-535486528

mshadbolt commented 4 years ago

In tabula muris Organ is currently bladder organ, this is a general term that could refer to a gall bladder or a urinary bladder, I believe this should be corrected to either gall bladder - UBERON:0002110 or urinary bladder - UBERON:0001255

I have issues with different levels of anatomy being used within the same field, e.g. colon and large intestine are listed, if we think large intestine is considered the organ then I would argue colon should be used as the organ_part instead

A similar thing for blood, if we are using hematopoietic system for bone marrow, then shouldn't we also use it as the organ for blood?

There's also an issue with lymph node and mediastinal lymph node, I think we should be consistent about what is the top level organ we would want to put for this. In the HumanTissueTcellActivation, mediastinal lymph node has been used for both organ and organ_parts

I think we need to address the mouse melanoma dataset that does not have an organ listed due to being a weird xenograft tumor experiment

I don't know if the best way moving forward on this is to ensure we as wranglers are being consistent or if there is a way to use the ontology hierarchy some how where we give it an ontology term and it can return the high level 'organ' for that term, or if this is something we could create somehoww

jahilton commented 4 years ago

I just did some digging on these yesterday so this is perfect (and saves folks from many many many questions I had)! Glad to see the focus on aligning these properties.

One route you could go: First step is to decide on a list of acceptable 'organ' options. Then annotate each sample's organ_part as the most specific ontology term possible. Then use the ontological relationships to see which organs from your list is a parent term of the specific organ_part. This will ensure consistency across submissions, reduces the times you're annotating a sample to 1, and it uses the ontologies to make the links for you. Also, organ would need to be a list as the bone marrow example above is a child term of both systems, so neither is wrong.

mshadbolt commented 4 years ago

Yes I was thinking similarly about perhaps curating a list of top level organs and using the ontology hierarchy to fill in all the terms between that organ and the most specific part term we can get from the contributor. I am also consulting with Zoe here to see if she has any ideas on whether there is an easy way of getting an organ from a child term. Wranglers have also proposed having a session about implementing these kinds of cool ontology things, particularly in the browser for searching, in other components so if you want to come on board as a supporter please do!

You are also right about the bone marrow example. The immune vs hematopoiesis system was something that was decided a while ago when there was an effort to harmonise metadata (before I arrived) but I don't think it was really documented anywhere so totally understandable why we have both terms at the moment as like you said they are both correct.

lauraclarke commented 4 years ago

@simonjupp Is this something having a zooma curated list for would help with? or do you have any other tools which might help here?

jahilton commented 4 years ago

ENCODE uses a python script, that I could easily/quickly strip down for a wrangler to feed in UBERON/CL/EFO term and get back the organ parent terms. Ideally, this could be calculated at submission, but in the mean time, wranglers could run that python script to help fill in spreadsheets.

mshadbolt commented 4 years ago

In my conversation with @zoependlington she also suggested that if we go down the curated list approach that we could potentially add a property to the hcao that would indicate that it is on the 'approved' list e.g. HCA_organ which could then presumably be incorporated as a validation check

lauraclarke commented 4 years ago

It would be good to consider when implementing such a system how to avoid failing valid organ systems and how long it would take for this to be updated if we missed something and needed to add it later

Otherwise, sounds like a good plan

zperova commented 4 years ago

Labs working on the same organ view it as a part of a different system depending on what type of questions they are studying. This is one of the reasons for the "discrepancy" in the Data Browser. In Immune Cell Atlas bone marrow is listed under immune system and in Human Hematopoietic Profiling bone marrow is under haematopoietic system (not surprisingly). We definitely need a discussion of how to harmonize it but without the implementation of ontologies in the Browser, this might not solve the problem.

zperova commented 4 years ago

Another thing I envision we need to have is a way of monitoring and updating datasets upon introduction of new ontologies - and not just for the organs but in general.

simonjupp commented 4 years ago

Uberon contains an major_organ subset. I've just extracted it below. How suitable is this?

bone element
brain
esophagus
eye
gonad
heart
kidney
large intestine
liver
lung
mouth
nose
pancreas
prostate gland
skin of body
small intestine
spinal cord
stomach
thymus
trachea
ureter
urethra
urinary bladder

You can easily get this out of the OLS API with curl -X GET 'https://www.ebi.ac.uk/ols/api/search?q=*&ontology=uberon&queryFields={label}&fieldList=label&slim=major_organ&rows=250' | jq -r .response.docs[].label

There's also an organ_slim that contains a lot more

Harderian gland
Hatschek's nephridium
abdominal external oblique muscle
abdominal internal oblique muscle
abomasum
adrenal gland
adrenal/interrenal gland
amphid sensory organ
ampulla of Lorenzini
ampullary organ
apocrine gland
apocrine sweat gland
arthropod neurohemal organ
articular capsule
biliary tree
bone element
brain meninx
bronchus
bulbo-urethral gland
bursa of Fabricius
cardiac gastric gland
cardiac stomach
carotid body
cartilage element
cavitated compound organ
cervical thymus
chest organ
chordotonal organ
clitoris
coccygeus muscle
coccyx
cochlear modiolus
compound eye
compound organ
coronal organ
cranial salt gland
crista ampullaris
crypt of Lieberkuhn
diaphragm
digestive system gland
dorsal pancreas
duodenal gland
dura mater
ear
eccrine sweat gland
endocrine gland
endocrine pancreas
endometrial gland
esophagus
exocrine gland
exocrine pancreas
external female genitalia
external genitalia
external male genitalia
eye
eye gland
eye sebaceous gland
eye skin gland
eyelid tarsus
female reproductive gland
female reproductive organ
fibrous pericardium
gall bladder
gastric gland
gizzard
gland of anal canal
gonad
hair follicle
head kidney
heart
hemipenis
hemolymphoid system gland
hemopoietic organ
immune organ
indifferent external genitalia
indifferent gonad
inferior parathyroid gland
internal female genitalia
internal genitalia
internal male genitalia
interrenal gland
intestinal gland
intromittent organ
ischiocavernosus muscle
kidney
lacrimal gland
large intestine
larynx submucosa gland
least splanchnic nerve
leptomeninx
liver
lobar bronchus
longus colli muscle
lower back skin
lung
main bronchus
major salivary gland
major vestibular gland
male reproductive gland
male reproductive organ
mammary gland
membranous labyrinth
meninx
mesobronchus
mesonephros
metanephros
minor salivary gland
minor vestibular gland
mucous gland
muscle organ
nasopharyngeal gland
olfactory gland
orbital septum
oropharyngeal gland
osphradium
otolith organ
ovary
pancreas
paraaortic body
parathyroid gland
pectoralis major
pectoralis minor
pelvic region element
penis
peptonephridium
pharyngeal gland
pharyngeal slit
pharyngotympanic tube
pineal body
pituitary gland
placenta
placenta metrial gland
preputial gland
pronephros
prostate gland
proventriculus
pubic symphysis
pygostyle
pyloric stomach
quadratus lumborum
rectal salt gland
rectus abdominis muscle
rectus capitis lateralis muscle
reproductive gland
reproductive organ
respiration organ
ruminant forestomach
saccule of membranous labyrinth
saliva-secreting gland
salt gland
scrotal sweat gland
sebaceous gland
segmental bronchus
seminal fluid secreting gland
seminal vesicle
serous gland
skeletal element
skeletal joint
skin gland
skin mucous gland
small intestine
solid compound organ
spermaceti organ
spinal cord
spinal cord arachnoid mater
spinal cord pia mater
spinal dura mater
spleen
statocyst
stomach
stomatodeum gland
superior parathyroid gland
supraneural body
sweat gland
swim bladder
synovial bursa
syrinx organ
tarsal gland
terminal bronchus
testis
thoracic thymus
thymoid
thymus
thyroid gland
tongue
trachea gland
tracheobronchial tree
transversus abdominis muscle
transversus thoracis
trunk region element
tympanic membrane
ureter
urethra
urethral gland
urinary bladder
uropygial gland
uterus
utricle of membranous labyrinth
vagina sebaceous gland
ventral pancreas
ventrobronchus
vermiform appendix
viscus
vomeronasal organ
mshadbolt commented 4 years ago

I think the first list doesn't have enough organs, and the second is far too permissive, i.e. there are organs that aren't in humans...

lauraclarke commented 4 years ago

So would it be possible to curate a list which is an extension of the first list? so we can ask folks like Krishnaa from CBTM or Marc Haluska to review and see if it looks right

zperova commented 4 years ago

@lauraclarke this is a very good way to go forward - ask our contributors whether this fits their need. We will have to curate and update the list and we go forward with new organs and especially with spatial work. Another thing to consider/discuss is asking contributor's at the time of the questionnaire where does their organ/organ part falls in the HCA curated list of organs/systems. Since it takes some time to add the missing term it will save us time and could be done while we are working on the rest of the metadata.

zperova commented 4 years ago

Since there is interest from the Seed Network participants on the sample metadata, I also think it would make sense to send them info on what kind of metadata we expect and also the list of organs to review. Some prep work beforehand will make the process easier in the long run.

lauraclarke commented 4 years ago

Adding it to the questionnaire is a great plan. It would be good to work with @gabsie and @morrisonnorman to figure out what the timeline will be for moving from the google forms to the questionnaire being integrated in the UI

jahilton commented 4 years ago

What is the value of having both organ and _organpart - for the contributor and/or consumer?

mshadbolt commented 4 years ago

I see it as very valuable to know a broader organ and a specific organ part for many contributors and consumers. Some scientists are interested in very specific parts of an organ and would want to subset data to that part. Having a broader organ gives an easy way to see subset at that level. We are an atlas where we are trying to give consumers the ability to 'zoom in' and 'zoom out' to the level that they are interested in, and if utilised correctly, ontologies give us the ability to provide that.

jahilton commented 4 years ago

we are trying to give consumers the ability to 'zoom in' and 'zoom out' to the level that they are interested in

Love this user story (more compelling than a contributor-focused user story to view their data on multiple levels, I think). Currently, if a user wanted to 'zoom in' on everything, they'd need to see if there is an _organpart specified, and, if not, look at the organ. Does it make since to always fill in both properties, even if they are the same?

diekhans commented 4 years ago

This seems odd that these are two fields. Isn' t the zoom in/out exactly what an ontology suppose to do?

The documentation in the schema isn't helpful in explaining the difference. So if this does get figured out, it would be really useful to update the schemas.

zperova commented 4 years ago

@diekhans yes, this issue is a workaround because currently there is no ontology implementation in the Data Browser.

diekhans commented 4 years ago

@zperova , has the data browser been approached about this issue? It seems a bit painful to create a field in the metadata to deal with this. Perhaps we can create an intermedia hack???

let me know how I can help??

mshadbolt commented 4 years ago

We are planning on having a session at the upcoming F2F about encouraging implementation of ontologies in smarter ways by other components so hopefully we get some things sorted out at that session

zperova commented 4 years ago

@diekhans yes, Data Browser is aware of the ontology expansion (to my knowledge) but its implementation has been stalled due to other priorities. The F2F session is a step towards it and your input will be very helpful if you decide to attend :) We are not proposing to create a new metadata field but to have an agreement among wranglers for which terms to use in the organ and organ_part until the ontology expansion has been implemented. @mshadbolt was pulling some statistics of what we have in the Browser and the discrepancies make us laugh and cry at the same time.

diekhans commented 4 years ago

@zperova I will do my best to be there. Just let me know how I can assist.

zperova commented 4 years ago

@diekhans the F2F session has been cancelled in preference for solving the versioning and AUDR issues, the ontology expansion conversation will resume in 2020 when it is on the DCP-wide Roadmap

mshadbolt commented 4 years ago

I think that we have in general agreed that having a subset of organs within the hcao will work as a solution. The next step will be assembling the list, which will require wider community input.

Once we have the list we can ask Zoe to make the changes to hcao and then look at implementation of the requirement that an organ must be part of this subset within the spreadsheet validator.

The side issue that has come up is what do we do about specimens from tissue that aren't technically organs?

But this is out of scope for this ticket and will close in favour of creating a ticket to start assembling a set of allowed organ ontology terms.