handling a diamond relationship in the FHIR standard when generating features using Featuretools

kveerama commented 6 years ago

FHIR structure has a diamond relationship. How to handle this when generating features using DFS as implemented in Featuretools.

kveerama commented 6 years ago

@sarahmish perhaps add your suggestions here?

sarahmish commented 6 years ago

Initially, all resources point to Reference to create relationships between "parent" entities, this is the primary reason for having a diamond graph as well as other complications. I propose 2 solutions for tackling this problem:

Resolving Reference links to be a direct connection. In this process, only resources Reference and Identity will be dropped.

breaking links

However, there are still problems with the diamond relationship (seen in red & orange).

Pros:

Sustain the overall structure of the FHIR standard.

Cons:

There is no determining factor that tells me which relationship should be resolved in the orange color because they contribute differently in each relation.

Partial Denormalization Merge the child entities with their parents to simplify the standard and obtain a smaller set of relationships.

denormalize

Again, the red relationship is still creating a diamond relationship. Since the set of subjects are essentially the same breaking either of the links should not cause loss of data. I suggest either of the following:

Keep the relationship that is a property of the target_entity and drop the others.
Drop the relationship that are computationally expensive and keep the least costly.

Pros:

Only parent resources are handled, making the feature engineering part easier.
Resolves most diamond relationships, the ones that are still left are in most cases able to be solved in one of the mentioned scenarios above.

Cons:

Breaks the FHIR structure.

Denormalization Merge the entire tables together into a single table.

Pros:

Resolves all relationship problems.

Cons:

No clear answer on in what order should the tables be merged.
Diminishes the FHIR structure.
Possibility of data loss.

sarahmish commented 6 years ago

Empirical results when comparing the computational time taken to calculate the feature matrix.

Dataset	Method	Problem	Average Execution Time	Columns
Data A	Partial Denormalization	1	11.818	24
Data A	Resolving Reference	1	8.026	27
Data A	Partial Denormalization	2	14.866	18
Data A	Resolving Reference	2	9.652	16
Data B	Partial Denormalization	1	65.606	56
Data B	Resolving Reference	1	31.967	22
Data B	Partial Denormalization	3	46.068	81
Data B	Resolving Reference	3	93.43	115

As evident from the table, there is no clear answer on which method sustains the most information extracted, therefore implementing the "Resolving References" method to solve the diamond graph is a better methodology because it reserves the FHIR structure.

sarahmish commented 6 years ago

Comparison of "Resolving References" using two methods:

MST: break the cycle using the minimum spanning tree of the graph.
Minimum Merge Cost (MMC): break the cycle by choosing the edges that have the least cost in terms of merging the dataframes.

both options move the information of the child table to the parent table before breaking the link.

Dataset	Method	Problem	Average Execution Time	Columns
Data A	MMC	1	10.526	35
Data A	MST	1	10.228	35
Data A	MMC	2	9.314	16
Data A	MST	2	9.298	16
Data B	MMC	1	30.8	22
Data B	MST	1	50.006	49
Data B	MMC	3	85.56	116
Data B	MST	3	88.888	118

Data A doesn't have remaining cycles after resolving the "Reference" table, therefore both procedures perform equally the same. Data B performs better with the implementation of MST.

sarahmish commented 5 years ago

Problem resolved using MST.

MLBazaar / Cardea

handling a diamond relationship in the FHIR standard when generating features using Featuretools #6