Big-Life-Lab / PHES-ODM

The Public Health Environmental Surveillance Open Data Model (PHES-ODM, or ODM). A data model, dictionary and support tools for environmental surveillance.
Creative Commons Attribution Share Alike 4.0 International
54 stars 18 forks source link

`pooled` and recording pooled samples through the Parent-Child relationship #236

Closed mathew-thomson closed 2 years ago

mathew-thomson commented 2 years ago

Upon reviewing our current structure, it seems that we lost sight of how we were going to account for pooled samples in the ODM.

Rather than trying to blow up the samples table to figure out a way of capturing this information, it was proposed that we could use the boolean pooled variable as a switch for the parent-child relationships.

So for a parent sample, pooled may or may not = 1 (TRUE). but the child samples will have pooled = 0 (FALSE). This shows the direction the parent-child relationship.

For a pooled sample, the multiple "parents" being pooled together will have pooled = 1 (TRUE), and the single sample created from pooling them will also have pooled = 1 (TRUE). However, the single "child" here will still be recorded as the parSampleID to the "parents" (recorded in the ERD as child samples).

This, while somewhat confusing, has the pooled field acting as a sort of "switch" on the directionality of parent-child relationships in the samples table.

A pooled sample can still have actual child samples as well, but these child sample would have pooled = 0 (FALSE).

This was discussed in a meeting with @jeandavidt @sorinsion @il43 and @DougManuel , but we're happy to hear any feedback or questions on this proposed structure as well.

DougManuel commented 2 years ago

The approach is robust, but technical. Good documentation and examples will help.

jeandavidt commented 2 years ago

I am wary of having an attribute (parent) mean two different things based on the value of another attribute (pooled).

To me, that jeopardizes one of the model's important features, which is the ability to be mapped onto an ontology: because then we are no longer only mapping between terms that have precise definitions, but are also trying to parse business logic (instead of simple x=>y associations, we end up with branching paths if(x, y=>u, y=>v). That could quickly make the model difficult to parse.

Here is an example setup:

flowchart TD
    A --> V;
    B --> V;
    V --> X;
    V --> Y;
    V --> Z;
    E[Samples] --- F;
    F[Pooled] --- G[Subsamples];
What you are suggesting, I think is sampleID parentID pooled
A V T
B V T
V null T
X V F
Y V F
Z V F
Which merges the sample -> pooled and the pooled -> subsample links. I would instead suggest splitting up concerns like this using a new attribute ("pooledID"): sampleID parentID pooledID pooled
A null V false
B null V false
V null null true
X V null false
Y V null false
Z V null false
sorinsion commented 2 years ago

I thought that the actual representation of your diagram would be: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

sampleID | parentID | pooled -- | -- | -- A |   | F B |   | F V | A | T V | B | T X | V | F Y | V | F Z | V | F

Isn't this the case?

jeandavidt commented 2 years ago

It's one way to represent it, yes. But we can't use that representation for the ODM because sampleID is a primary key, so it has to be unique (you have V in two rows there).

sorinsion commented 2 years ago

In this case I think we should make a sampleID+parentID composite key and make this the PK of the table. This way you remove any ambiguity, for example, about which measurements, let's say temperature, refer to the initial sample vs its children

DougManuel commented 2 years ago

@jeandavidt good points. The ODM is complex, and we should opt for clarity. People will be more likely to understand the sample flow if there is both parentID and pooledID.

One aspect I find challenging is that people could be confused when samples are pooled and split at a later date. The sample data and lastUpdated will be important attributes. I suggest that we have a tutorial and explanations that include these scenarios.

DougManuel commented 2 years ago

@sorinsion I didn't quite follow your composite key suggestion. Would you mind making a database example using @jeandavidt graph example?

jeandavidt commented 2 years ago

@DougManuel About the point mentioned in #235, notice that in the last table of my example, only V (the actual pooled sample) has pooled=True. The constituent samples (A and B) are not affected.

DougManuel commented 2 years ago

I can come up with situations where our approaches break down. We'll need to be realistic about what we should represent in V2.0 (without a table of many-to-many relationships).

The end goal in a later ODM version is probably to have a data graph. At least, that is what I see described in provenance and lineage discussions. What we'll need to do is focus on the following: 1) the most common current use cases and try to represent that clearly. 2) a solution that can be refactored into a many-to-many table or relationship or graph.

Screenshot 2022-10-25 at 5 00 26 PM
jeandavidt commented 2 years ago

There is also the issue that one could potentially make different pooled samples from different combinations of source samples. Consider this:

flowchart TD
    A --> V;
    B --> V;
    B --> U;
    C --> U;
    V --> X;
    V --> Y;
    U --> Z;

In a case like this, try as we may, I don't think we can mash all the links in the samples table without having lists in some of the fields.

I hate to bring this up because I know you don't like those, but a lookup table like the one shown here could let us create the relationships between samples and add new ones as they are created.

sorinsion commented 2 years ago

@sorinsion I didn't quite follow your composite key suggestion. Would you mind making a database example using @jeandavidt graph example?

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

samplePK | sampleID | parentID | pooled -- | -- | -- | -- A | A |   | F B | B |   | F VA | V | A | T VB | V | B | T WV | W | V | F XWV | X | WV | F YWV | Y | WV | F ZWV | Z | WV | F

I introduced a W sampleID which has V as a parent, where V is A+B Is this comprehensible?

sorinsion commented 2 years ago

image

should translate in something like this:

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

samplePK | sampleID | parentID | pooled -- | -- | -- | -- A | A |   | F B | B |   | F wA | w | A | T wB | w | B | T Vw | V | w | F XV | X | V | F YV | Y | V | F sV | s | V | T sC | s | C | T Zs | Z | s | F

Here I introduced a synthetic "w" and "s" pooled sampleIDs that are referred to by the physical samples.

DougManuel commented 2 years ago

@jeandavidt A look-up table or relationship table is probably the way to go. What you've outlined in your above link is quite easy to follow, IMO, and robust. Maybe it is the best solution for version 2?

Does the relationship table that you created work for a graph database? Subject, Object and direction? I like the naming of the relationship -- that adds a lot of value.

I can envision a relationship table that could work for measures, methods, samples, and sites. subjectID, objectID. subjectType, objectType.

pasting your link in case it get deleted in the future.

Screenshot 2022-10-25 at 5 22 46 PM Screenshot 2022-10-25 at 5 24 18 PM
sorinsion commented 2 years ago

and this: image

should translate into this: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

samplePK | sampleID | parentID | pooled -- | -- | -- | -- A | A |   | F B | B |   | F wA | w | A | T wB | w | B | T Vw | V | w | F XV | X | V | F YV | Y | V | F C | C |   | F sB | s | B | T sC | s | C | T Zs | Z | s | F

sorinsion commented 2 years ago

However, in my opinion the original sample, as an entity, is effectively destroyed when subsampling - it actually splits in a set of children which should get their own IDs

DougManuel commented 2 years ago

@sorinsion a common scenario for subsamples is the sample collection will have subsamples sent to different labs, biobanks, etc. with the original sample stored.

DougManuel commented 2 years ago

This discussion thread has clarified that the best solution is a 'relationship' table, as described by @jeandavidt. We've been trying to have a simpler solution, but we haven't found any. It feels like we've been trying to fit a square peg (a graph with many-to-many relationships and branches) into a round hole (a list or registry of samples).

Having a relationship table is the simplest, clearest approach.

however, regardless of the approach, we may want to add a 'purpose' field to the sample regardless of our solution. The purpose field would follow the same concept as the purpose field of the measures table. I have been reviewing the approach to provenance and lineage, and that review has reinforced to me that we should have the who, what, where and why of sampling and measures. 'purpose' is the 'why'.

@jeandavidt could define the relationship table. What he has looks good to me, but we'd want to double-check that it conforms (generally) to a graph database.

I suggested that a relationship table could work for other similar relationships in other tables (measures and methods). Although true, I don't think we should refactor those tables -- at least at this time. Rather, have relationships for just samples. Although we may want to expand the use of a relationships table beyond samples for a later ODM version.

mathew-thomson commented 2 years ago

After group discussion it was decided that the best way to deal with this issue is to break the parent-child relationships out into a separate, relationships table.

So parSampleID is removed from the samples table.

There is a new table, sampleRelationships with three main parts:

  1. sampleIDSubject
  2. relationshipID
  3. sampleIDObject

Each row can be read as a sentence of the form:

[sampleIDSubject] is a[relationshipID] of [sampleIDObject]

The acceptable values for relationshipID are:

This allows for multiple children and/or multiple parents to be specified, and expresses a unidirectional lineage for a sample. For pooled samples and replicates, the repType attribute still specifies what kind of replicate it is, and the pooled attribute still is used as a boolean indicator for pooled samples.

DougManuel commented 2 years ago

The approach seems to follow the proposed Graph Query Language.

MATCH [:sampleIDSubject] - {:relationshipID) -> (:sampleIDObject)