Big-Life-Lab / PHES-ODM

The Public Health Environmental Surveillance Open Data Model (PHES-ODM, or ODM). A data model, dictionary and support tools for environmental surveillance.
Creative Commons Attribution Share Alike 4.0 International
54 stars 18 forks source link

`hasChild` Boolean Field for Parent-Child variables? #235

Closed mathew-thomson closed 1 year ago

mathew-thomson commented 2 years ago

I was in a meeting with @Bahamyirou and Nikho earlier today, and while discussing the parent-child relationships present in the datasets table and the sites table, it was mentioned that it would be useful to know if a site, sample, dataset, etc. was a parent, so that it could be queried in the parentID column. This would make it easier to find children, etc.

What I suggested as a solution to this was that there be an additional field: hasChild added to all tables with parent-child IDs, which would be a boolean indicator of:

example from the site table (note: not using real IDs here):

parSiteID SiteID dataID hasChild
dt tmu-dt tmu 0
dt uot-dt uot 0
-blank- ropec uott 0
-blank- dt -blank- 1

Curious to hear if @jeandavidt @DougManuel or @sorinsion have any qualms or issues with this suggestion. If everyone is on board, I'll go ahead with adding it to the ERD and the Parts list.

DougManuel commented 2 years ago

This makes sense to me. Is it possible to create an arrow diagram to support the discussion? It may also help to have a more concrete example like EU DEEP and country data.

jeandavidt commented 2 years ago

The worry I have with a scheme like this is that there is no enforcement mechanism. If someone takes a sample and makes a subsample out of it, will they ever think of not only adding their subsample in the DB, but also go back and edit the parent sample's data?

The functionality they are looking for could be achieved by querying the table for rows where the sampleID they are looking for matches the parent field - which they will end up having to do anyway if they aren't 100% sure that the hasChild property has been filled in correctly for 100% of samples

If there is a way to ensure data integrity then I see no problem. But relying on documentation is, in my view, unlikely to work

Bahamyirou commented 2 years ago

@jeandavidt by matches the parent field - do you mean sampleID that matches with the parent Site ID?

sorinsion commented 2 years ago

A parent/child set of columns should be sufficient to build the full "family tree" (or forest in our case) if you parse the full set - which one should do in any case to fully restore the relationships between elements. Adding extra columns like "has child" or "pooled" is not really necessary and, although it makes the table easier to read/filter by humans, it actually adds complexity both when implementing the model and by enforcing rules to be followed by those entering the data (rules that might be not always be followed, as @jeandavidt mentioned). However, considering a minimal, clear set of rules (like "all new, pooled or derived samples should be labeled with a unique sampleID" AND all samples should include their parent(s)ID, if they have one(s) - at least in the database if not on the physical sample as well, where a simple "pooled" flag might suffice), one could fully restore the "history" of a sample.

jeandavidt commented 2 years ago

@jeandavidt by matches the parent field - do you mean sampleID that matches with the parent Site ID?

I mean, if someone is doing a query for a certain set of samples matching condition A and they want to include the children of any sample that matches A as well, they can do something along the lines of (this is only pseudocode to get the point across):


WITH result AS ( SELECT * from samples WHERE (A))
SELECT * FROM result
UNION 
SELECT * from samples WHERE samples.parSampleID = result.sampleID
DougManuel commented 2 years ago

@jeandavidt I have the same concern with having to update previous records. It seems like a problem that will happen with your proposed approach in https://github.com/Big-Life-Lab/PHES-ODM/issues/236#issuecomment-1289398192 when someone pools a sample at a later date.

Bahamyirou commented 2 years ago

@jeandavidt there is no parentID in the ERD. can you clarify

jeandavidt commented 2 years ago

I edited my post.

parentID -> parSampleID

mathew-thomson commented 2 years ago

I take everyone's point here, and agree with @sorinsion that a parent and child column would allow for the reconstruction of the family tree. But the trouble becomes that there are often multiple children, and in the case of pooled samples there are multiple parents. This isn't necessarily a deal breaker, but we're tried to avoid have lists and arrays as inputs.

I also agree with @jeandavidt that its difficult to enforce these rules and that they may err on the side of being too complicated. However, we do have a validation schema and library that may be able to catch this. I also kind of think there it's possible to break the rules anywhere in the ODM if people really want to.

As far as difficulties with updating historic data if samples are pooled later one, those entries will need to be updated anyway (with the child column, or other implementations) so the hasChild field doesn't seem like a bad option to me still.

DougManuel commented 1 year ago

We decided to create a new relationships table. This approach should be robust and flexible.