COOL-cohort / COOL

the source code of the COOL system
https://www.comp.nus.edu.sg/~dbsystem/cool/
Apache License 2.0
45 stars 16 forks source link

Missing entrires in results of query1-0.json #24

Closed liuchangshiye closed 2 years ago

liuchangshiye commented 2 years ago

After executing the command with query1-0.json to obtain the specific users from the health dataset, COOL shows 8513 users. But the correct number of users should be 8592. The user ids that start with "P-19709" are all missed. The missed user ids are as follows.

 P-19709, P-19712, P-19713, P-19718, P-19720, P-19721, P-19732, P-19749, P-19762, P-19767,
 P-19772, P-19773, P-19776, P-19777, P-19778, P-19780, P-19782, P-19789, P-19790, P-19791,
 P-19797, P-19798, P-19799, P-19801, P-19803, P-19805, P-19806, P-19809, P-19812, P-19813,
 P-19820, P-19830, P-19834, P-19841, P-19842, P-19843, P-19844, P-19846, P-19860, P-19861,
 P-19864, P-19872, P-19874, P-19879, P-19882, P-19887, P-19890, P-19893, P-19901, P-19902,
 P-19907, P-19909, P-19910, P-19911, P-19915, P-19919, P-19923, P-19929, P-19931, P-19941,
 P-19944, P-19948, P-19950, P-19952, P-19954, P-19959, P-19962, P-19964, P-19972, P-19973,
 P-19975, P-19978, P-19983, P-19985, P-19989, P-19991, P-19992, P-19994, P-19998
hugy718 commented 2 years ago

Found the bug. It is due to incorrect dealing with time window, when birth time is determined in getUserBirthTime in the ExtendedCohortSelection class. maxDate of this class is the max date in dataset. The date range for locating birth event in the original code needs to fall in the date of the user's first record and the max date - window size. This creates a problem for tail users that have their records very close to the max date. So their ranges to find birth event are empty ranges. Now I set the range to look for birth event for one user to be from the date of its first record to the date of its last record. It can output all the users.

KimballCai commented 2 years ago

I agree with the reason that causes this issue, but I do not agree to change the code because it is not the problem of the core logic. Instead, we need to add introductions about how to utilize the time window attribute.

KimballCai commented 2 years ago

If we want to use the time window, it means that we only want to select users who have records for at least x days where x is defined in the time window.

KimballCai commented 2 years ago

But we also need to consider the users whose last record date is earlier than the max data in the data but do not have records for continuous x days.