HumanCellAtlas / metadata-schema

This repo is for the metadata schemas associated with the HCA
Apache License 2.0
64 stars 32 forks source link

Added disease_adjacent to specimen_from_organism. Fixes #1512. #1514

Closed idazucchi closed 1 year ago

idazucchi commented 1 year ago

Release notes

For specimen_from_organism.json schema:

Why are these changes needed?

This field is useful when determining the difference between 'normal' healthy tissue and 'disease-adjacent' tissue which often is affected by its proximity to the pathology area. It has been requested by the Skin Bionetworks, as this is of value to their metadata and analysis.

Reviews requested

This is a minor schema changes, all DCP2 reviewers need to review

ESapenaVentura commented 1 year ago

LGTM - Just a bio question... do we have any guidelines on what adjacency is? like, let's say we find a paper that states the tissue was taken 5 cm away from the affected tissue... is that adjacent? have you found/heard from the bionetworks anything about that?

NoopDog commented 1 year ago

Yes, how would you represent the disease the specimen was adjacent to? If the donor had multiple diseases how would you encode what the adjacent disease was?

Looks like to me that we may want to:

  1. Record the adjacent disease, if any, as member of disease ontology (array) rather than a boolean.
  2. If there is an "adjacent disease" to the sample, make sure that the disease is also recorded in the donor's diseases.

Otherwise, even if you use donorOrganism.KnownDiseases to record the "adjacent to" disease, you can't tell which of the donor's possibly multiple diseases were adjacent to the sample.

idazucchi commented 1 year ago

Thanks for the suggestion @NoopDog ! I've changed the disease_adjacent to associated_diseases which imports the disease module to describe the adjacent disease. We already record the adjacent disease at the donor level as a part of our best practices, but it's useful to be able to record it at the specimen level more clearly.

@ESapenaVentura we don't have any precise guidelines for what qualifies as disease adjacent from the bionetworks. I would rely on what is stated in the paper to determine if a tissue is disease adjacent, but we can also reach out to the bionetworks and see if there's a consensus on distance

hannes-ucsc commented 1 year ago

Having both diseases and adjacent_diseases introduces the possibility of both being populated. Is that intended and does it makes sense?

If we go this route and don't change anything in Azul, adjacent_diseases would not be visible in the Data Browser or usable for filtering or sorting. Is it more likely that people who use the Data Browser to filter by a particular disease would expect the search result to include specimen from tissue adjacent to tissue affected by that disease, even if the tissue the specimen was collected from doesn't have that disease?

NoopDog commented 1 year ago

Having both diseases and adjacent_diseases introduces the possibility of both being populated. Is that intended and does it makes sense?

It makes sense to me that diseases would always be populated with adjacent_diseases. These are two different facts.

  1. The donor has a disease.
  2. Diseased tissue is adjacent to the sample site.

Current practice is:

We already record the adjacent disease at the donor level as a part of our best practices, but it's useful to be able to record it at the specimen level more clearly.

Of course, 1 can be inferred from 2 above but it offloads all clients from having to make this inference by making it explicit in the database, and to me, this seems desirable.

idazucchi commented 1 year ago

Hi @hannes-ucsc

Having both diseases and adjacent_diseases introduces the possibility of both being populated. Is that intended and does it makes sense?

Let's say a donor has a kidney tumor and they undergo nephrectomy to get rid of the tumor. This donor donates two tissue samples: A. one from the tumor site
B. one 3cm away from the tumor. We want to describe the specimens in this way: A. disease: tumor B. disease: normal adjacent_diseases: tumor This is to highlight that although the tissue is believed to be healthy it was taken from a site in proximity of a disease and could be affected by it. You can see how in this scenario we expect that disease and adjacent_diseases might be both filled at the same time and it makes sense to do so. We don't need adjacent_diseases to be indexed, anyone who is interested in a specific disease can filter for it at the donor level and the adjacent_diseases specimens, although healthy, will be selected

hannes-ucsc commented 1 year ago

You can see how in this scenario we expect that disease and adjacent_diseases might be both filled at the same time and it makes sense to do so.

Got it. Thanks.

We don't need adjacent_diseases to be indexed, anyone who is interested in a specific disease can filter for it at the donor level and the adjacent_diseases specimens, although healthy, will be selected

I see. Given that, Azul will ignore the adjacent_diseases field. If a donor has multiple different tumors, and specimens were collected from tissue adjacent to one of these tumors, but no more, all of the donor's specimens (and the files derived from them) will match a filter that specifies only one of the tumor diseases. I think that's an acceptable conflation.