Revisit the data structures used to represent a cohort

hugy718 commented 2 years ago

After merging the PR #72 , we shall revisit how to store a cohort. Maybe there is a better approach than using a list of global id for subsequent processing.

Supporting append relaxes the assumption of monotonic increasing user key in a cube, which will break all analysis when a cohort input is given. The problem is that during processing, the input vector of cohort is checked from small to large userkey id only once. No retraction.

The cohort representation is also related to PR #75 as well. We no longer plan to maintain a consistent global id across cublets, thus we cannot use a list of id from a cube to represent a cohort.

One possible direction is to directly store the user key value and we do translation through metachunk of a cublet, before processing. And applies this also to the appended blocks.

Zrealshadow commented 2 years ago

Actually, the new cohort processing engine can't support process query based on input cohort result currently. Maybe we should discuss how to implement it and set a reasonable standard according to the modification in global id in next meeting.

hugy718 commented 2 years ago

Update: the cohort will be persisted in a new format that stores the result cublet-wise, since all processing are done cublet by cublet. cublet level id for the user key will be used in this format.

Regarding the dataset update, it will not be addressed with the cohort representation. Each cohort is tied to a certain version of a cube. Hence, the problem shall be addressed by processing to consume a cohort of old version and the actual data to identify the diff with the new version and generate a new cohort (updated cohort).

KimballCai commented 2 years ago

Update: we finally decide to use Strings of user names to store the cohorts. When loading the cohorts, we transfer the strings into global ids which may be different in diverse cublets. Fixed in PR https://github.com/COOL-cohort/COOL/pull/105

COOL-cohort / COOL

Revisit the data structures used to represent a cohort #76