Closed kveerama closed 5 years ago
@sarahmish perhaps add your suggestions here?
Initially, all resources point to Reference
to create relationships between "parent" entities, this is the primary reason for having a diamond graph as well as other complications. I propose 2 solutions for tackling this problem:
Reference
links to be a direct connection.
In this process, only resources Reference
and Identity
will be dropped.However, there are still problems with the diamond relationship (seen in red & orange).
Pros:
Cons:
Again, the red relationship is still creating a diamond relationship.
Since the set of subjects
are essentially the same breaking either of the links should not cause loss of data. I suggest either of the following:
target_entity
and drop the others.Pros:
Cons:
Pros:
Cons:
Empirical results when comparing the computational time taken to calculate the feature matrix.
Dataset | Method | Problem | Average Execution Time | Columns |
---|---|---|---|---|
Data A | Partial Denormalization | 1 | 11.818 | 24 |
Data A | Resolving Reference | 1 | 8.026 | 27 |
Data A | Partial Denormalization | 2 | 14.866 | 18 |
Data A | Resolving Reference | 2 | 9.652 | 16 |
Data B | Partial Denormalization | 1 | 65.606 | 56 |
Data B | Resolving Reference | 1 | 31.967 | 22 |
Data B | Partial Denormalization | 3 | 46.068 | 81 |
Data B | Resolving Reference | 3 | 93.43 | 115 |
As evident from the table, there is no clear answer on which method sustains the most information extracted, therefore implementing the "Resolving References" method to solve the diamond graph is a better methodology because it reserves the FHIR structure.
Comparison of "Resolving References" using two methods:
both options move the information of the child table to the parent table before breaking the link.
Dataset | Method | Problem | Average Execution Time | Columns |
---|---|---|---|---|
Data A | MMC | 1 | 10.526 | 35 |
Data A | MST | 1 | 10.228 | 35 |
Data A | MMC | 2 | 9.314 | 16 |
Data A | MST | 2 | 9.298 | 16 |
Data B | MMC | 1 | 30.8 | 22 |
Data B | MST | 1 | 50.006 | 49 |
Data B | MMC | 3 | 85.56 | 116 |
Data B | MST | 3 | 88.888 | 118 |
Data A doesn't have remaining cycles after resolving the "Reference" table, therefore both procedures perform equally the same. Data B performs better with the implementation of MST.
Problem resolved using MST.
FHIR structure has a diamond relationship. How to handle this when generating features using DFS as implemented in Featuretools.