`pooled` and recording pooled samples through the Parent-Child relationship

mathew-thomson commented 2 years ago

Upon reviewing our current structure, it seems that we lost sight of how we were going to account for pooled samples in the ODM.

Rather than trying to blow up the samples table to figure out a way of capturing this information, it was proposed that we could use the boolean pooled variable as a switch for the parent-child relationships.

So for a parent sample, pooled may or may not = 1 (TRUE). but the child samples will have pooled = 0 (FALSE). This shows the direction the parent-child relationship.

For a pooled sample, the multiple "parents" being pooled together will have pooled = 1 (TRUE), and the single sample created from pooling them will also have pooled = 1 (TRUE). However, the single "child" here will still be recorded as the parSampleID to the "parents" (recorded in the ERD as child samples).

This, while somewhat confusing, has the pooled field acting as a sort of "switch" on the directionality of parent-child relationships in the samples table.

A pooled sample can still have actual child samples as well, but these child sample would have pooled = 0 (FALSE).

This was discussed in a meeting with @jeandavidt @sorinsion @il43 and @DougManuel , but we're happy to hear any feedback or questions on this proposed structure as well.

DougManuel commented 2 years ago

The approach is robust, but technical. Good documentation and examples will help.

jeandavidt commented 2 years ago

I am wary of having an attribute (parent) mean two different things based on the value of another attribute (pooled).

To me, that jeopardizes one of the model's important features, which is the ability to be mapped onto an ontology: because then we are no longer only mapping between terms that have precise definitions, but are also trying to parse business logic (instead of simple x=>y associations, we end up with branching paths if(x, y=>u, y=>v). That could quickly make the model difficult to parse.

Here is an example setup:

flowchart TD
    A --> V;
    B --> V;
    V --> X;
    V --> Y;
    V --> Z;
    E[Samples] --- F;
    F[Pooled] --- G[Subsamples];

What you are suggesting, I think is	sampleID	parentID
A	V	T
B	V	T
V	null	T
X	V	F
Y	V	F
Z	V	F

Which merges the sample -> pooled and the pooled -> subsample links. I would instead suggest splitting up concerns like this using a new attribute ("pooledID"):	sampleID	parentID	pooledID
A	null	V	false
B	null	V	false
V	null	null	true
X	V	null	false
Y	V	null	false
Z	V	null	false

sorinsion commented 2 years ago

I thought that the actual representation of your diagram would be: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

sampleID | parentID | pooled -- | -- | -- A | | F B | | F V | A | T V | B | T X | V | F Y | V | F Z | V | F

Isn't this the case?

jeandavidt commented 2 years ago

It's one way to represent it, yes. But we can't use that representation for the ODM because sampleID is a primary key, so it has to be unique (you have V in two rows there).

sorinsion commented 2 years ago

In this case I think we should make a sampleID+parentID composite key and make this the PK of the table. This way you remove any ambiguity, for example, about which measurements, let's say temperature, refer to the initial sample vs its children

DougManuel commented 2 years ago

@jeandavidt good points. The ODM is complex, and we should opt for clarity. People will be more likely to understand the sample flow if there is both parentID and pooledID.

One aspect I find challenging is that people could be confused when samples are pooled and split at a later date. The sample data and lastUpdated will be important attributes. I suggest that we have a tutorial and explanations that include these scenarios.

DougManuel commented 2 years ago

@sorinsion I didn't quite follow your composite key suggestion. Would you mind making a database example using @jeandavidt graph example?

jeandavidt commented 2 years ago

@DougManuel About the point mentioned in #235, notice that in the last table of my example, only V (the actual pooled sample) has pooled=True. The constituent samples (A and B) are not affected.

Edit Oh, ok, I see 🤪 The pooledID value of A and Bdo change, though. I see that that can be problematic.

DougManuel commented 2 years ago

I can come up with situations where our approaches break down. We'll need to be realistic about what we should represent in V2.0 (without a table of many-to-many relationships).

The end goal in a later ODM version is probably to have a data graph. At least, that is what I see described in provenance and lineage discussions. What we'll need to do is focus on the following: 1) the most common current use cases and try to represent that clearly. 2) a solution that can be refactored into a many-to-many table or relationship or graph.

jeandavidt commented 2 years ago

There is also the issue that one could potentially make different pooled samples from different combinations of source samples. Consider this:

flowchart TD
    A --> V;
    B --> V;
    B --> U;
    C --> U;
    V --> X;
    V --> Y;
    U --> Z;

In a case like this, try as we may, I don't think we can mash all the links in the samples table without having lists in some of the fields.

I hate to bring this up because I know you don't like those, but a lookup table like the one shown here could let us create the relationships between samples and add new ones as they are created.

sorinsion commented 2 years ago

@sorinsion I didn't quite follow your composite key suggestion. Would you mind making a database example using @jeandavidt graph example?

samplePK | sampleID | parentID | pooled -- | -- | -- | -- A | A | | F B | B | | F VA | V | A | T VB | V | B | T WV | W | V | F XWV | X | WV | F YWV | Y | WV | F ZWV | Z | WV | F

I introduced a W sampleID which has V as a parent, where V is A+B Is this comprehensible?

sorinsion commented 2 years ago

should translate in something like this:

samplePK | sampleID | parentID | pooled -- | -- | -- | -- A | A | | F B | B | | F wA | w | A | T wB | w | B | T Vw | V | w | F XV | X | V | F YV | Y | V | F sV | s | V | T sC | s | C | T Zs | Z | s | F

Here I introduced a synthetic "w" and "s" pooled sampleIDs that are referred to by the physical samples.

DougManuel commented 2 years ago

@jeandavidt A look-up table or relationship table is probably the way to go. What you've outlined in your above link is quite easy to follow, IMO, and robust. Maybe it is the best solution for version 2?

Does the relationship table that you created work for a graph database? Subject, Object and direction? I like the naming of the relationship -- that adds a lot of value.

I can envision a relationship table that could work for measures, methods, samples, and sites. subjectID, objectID. subjectType, objectType.

pasting your link in case it get deleted in the future.

sorinsion commented 2 years ago

and this:

should translate into this: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

samplePK | sampleID | parentID | pooled -- | -- | -- | -- A | A | | F B | B | | F wA | w | A | T wB | w | B | T Vw | V | w | F XV | X | V | F YV | Y | V | F C | C | | F sB | s | B | T sC | s | C | T Zs | Z | s | F

sorinsion commented 2 years ago

However, in my opinion the original sample, as an entity, is effectively destroyed when subsampling - it actually splits in a set of children which should get their own IDs

DougManuel commented 2 years ago

@sorinsion a common scenario for subsamples is the sample collection will have subsamples sent to different labs, biobanks, etc. with the original sample stored.

DougManuel commented 2 years ago

This discussion thread has clarified that the best solution is a 'relationship' table, as described by @jeandavidt. We've been trying to have a simpler solution, but we haven't found any. It feels like we've been trying to fit a square peg (a graph with many-to-many relationships and branches) into a round hole (a list or registry of samples).

Having a relationship table is the simplest, clearest approach.

If you look at the above relationship table, it is the most understandable approach. I expect that without documentation, people have a good chance of being able to draw the flow of samples, whereas this is not the case for other discussed solutions.
A downside is the generation of another table. People need to know where to look for split and pooled samples.
However, remember that we currently have only a small proportion of users that use this ODM feature. So, moving to a new table can remove fields from the main samples table.
Therefore, removing fields such as parentID and pooledID makes the sample table easier to understand for most users with a clearer purpose. The sample entry should never need to be updated if there is pooling or sampling. The pooled or split sample becomes a new sample with a new entry.
The relationship table allows a greater and more explicit list of relationships: 'child', 'field duplicate', 'lab control', 'biobank'.

however, regardless of the approach, we may want to add a 'purpose' field to the sample regardless of our solution. The purpose field would follow the same concept as the purpose field of the measures table. I have been reviewing the approach to provenance and lineage, and that review has reinforced to me that we should have the who, what, where and why of sampling and measures. 'purpose' is the 'why'.

@jeandavidt could define the relationship table. What he has looks good to me, but we'd want to double-check that it conforms (generally) to a graph database.

I suggested that a relationship table could work for other similar relationships in other tables (measures and methods). Although true, I don't think we should refactor those tables -- at least at this time. Rather, have relationships for just samples. Although we may want to expand the use of a relationships table beyond samples for a later ODM version.

mathew-thomson commented 2 years ago

After group discussion it was decided that the best way to deal with this issue is to break the parent-child relationships out into a separate, relationships table.

So parSampleID is removed from the samples table.

There is a new table, sampleRelationships with three main parts:

sampleIDSubject
relationshipID
sampleIDObject

Each row can be read as a sentence of the form:

[sampleIDSubject] is a[relationshipID] of [sampleIDObject]

The acceptable values for relationshipID are:

child
replicate

This allows for multiple children and/or multiple parents to be specified, and expresses a unidirectional lineage for a sample. For pooled samples and replicates, the repType attribute still specifies what kind of replicate it is, and the pooled attribute still is used as a boolean indicator for pooled samples.

DougManuel commented 2 years ago

The approach seems to follow the proposed Graph Query Language.

MATCH [:sampleIDSubject] - {:relationshipID) -> (:sampleIDObject)

Big-Life-Lab / PHES-ODM

`pooled` and recording pooled samples through the Parent-Child relationship #236