MLBazaar / Cardea

An open source automl library for using machine learning in healthcare.
https://mlbazaar.github.io/Cardea/
MIT License
115 stars 21 forks source link

handling a diamond relationship in the FHIR standard when generating features using Featuretools #6

Closed kveerama closed 5 years ago

kveerama commented 6 years ago

FHIR structure has a diamond relationship. How to handle this when generating features using DFS as implemented in Featuretools.

image

kveerama commented 6 years ago

@sarahmish perhaps add your suggestions here?

sarahmish commented 6 years ago

Initially, all resources point to Reference to create relationships between "parent" entities, this is the primary reason for having a diamond graph as well as other complications. I propose 2 solutions for tackling this problem:

  1. Resolving Reference links to be a direct connection. In this process, only resources Reference and Identity will be dropped.

breaking links

However, there are still problems with the diamond relationship (seen in red & orange).

Pros:

Cons:

  1. Partial Denormalization Merge the child entities with their parents to simplify the standard and obtain a smaller set of relationships.

denormalize

Again, the red relationship is still creating a diamond relationship. Since the set of subjects are essentially the same breaking either of the links should not cause loss of data. I suggest either of the following:

Pros:

Cons:

  1. Denormalization Merge the entire tables together into a single table.

Pros:

Cons:

sarahmish commented 6 years ago

Empirical results when comparing the computational time taken to calculate the feature matrix.

Dataset Method Problem Average Execution Time Columns
Data A Partial Denormalization 1 11.818 24
Data A Resolving Reference 1 8.026 27
Data A Partial Denormalization 2 14.866 18
Data A Resolving Reference 2 9.652 16
Data B Partial Denormalization 1 65.606 56
Data B Resolving Reference 1 31.967 22
Data B Partial Denormalization 3 46.068 81
Data B Resolving Reference 3 93.43 115

As evident from the table, there is no clear answer on which method sustains the most information extracted, therefore implementing the "Resolving References" method to solve the diamond graph is a better methodology because it reserves the FHIR structure.

sarahmish commented 6 years ago

Comparison of "Resolving References" using two methods:

both options move the information of the child table to the parent table before breaking the link.

Dataset Method Problem Average Execution Time Columns
Data A MMC 1 10.526 35
Data A MST 1 10.228 35
Data A MMC 2 9.314 16
Data A MST 2 9.298 16
Data B MMC 1 30.8 22
Data B MST 1 50.006 49
Data B MMC 3 85.56 116
Data B MST 3 88.888 118

Data A doesn't have remaining cycles after resolving the "Reference" table, therefore both procedures perform equally the same. Data B performs better with the implementation of MST.

sarahmish commented 5 years ago

Problem resolved using MST.