Rebuild cohort processing

KimballCai commented 2 years ago

Please update the details here when reviewing the codes. (issues to be addressed)

Create an issue and submit a corresponding PR for each one.

need to add the logic to check the meta chunk to speed up the query processing.
remove aggregation logic in ValueSelection.
- ValueSelection only handles filtering (can be merged with EventSelection)
- RetUnit will encapsulate the logic of updating statistics (discussed below)
Factory class to handle FieldRS creation (better encapsulation)
rework the ProjectedTuple
- existing impl aims to mimic the old logic which introduces additional indirections
- keep it simple, let its producer decides on which indexed value to retrieve from it.
- make it immutable. avoid loadattr that mutates internal data.
- separate handling for special field: user id and action time
MetaFieldRS to create value converter/translator (extensibility and polymorphism)
- currently we only have two types, we can expect having more field types.
- The MetaFieldRS translates the token (gid for string now) stored in data chunk to actual values. (no-op for range)
Augment RetUnit
- perhaps we should rename this variable
- it will contain additional variables for max, min, etc. (now only counts)
- A solution is to have a list of variable and keep a list functions (aggregators) that take in new value and update its corresponding variable.
add documentations: DataHashFieldRS Need to add descriptions to the assumptions: all input vector ever used there have efficient implementation of getting by index (Zint, BitVector, ZintBitInput,)

KimballCai commented 2 years ago

Please update the details here when reviewing the codes.

Zrealshadow commented 2 years ago

Now, program can run. However, there are some problem to unify and discuss.

For health data, the input data in CohortProcessorTest is raw data which have about 27w data items. In other unit test, the input test dataset is a sample of all dataset. In order to avoid ambiguity, I aliases the raw dataset as health_raw.

For ecommerce data, in CohortProcessorTest, I have to pre-process the raw data first and then the cohort engine can analyze it. I store the processed dataset in ecommerce_query.

Now under datasets directory, there are six datasets. Maybe we have to merge the ecommerce and ecommerce_query, health and health_raw.

This problem involves the previous issue https://github.com/COOL-cohort/COOL/issues/41#issue-1222199462. we should unify the format of the data involved and set a standard.

hugy718 commented 2 years ago

BTW, please remove commented-out codes (including // TODO Auto-generated method stub), improper spacing and redundant imports.

NLGithubWP commented 2 years ago

This PR has made the following change:

For Test and dataset:
- Rename datasetSource to CubeRepo
For Schema
- Add Set data type as an indicator.
For readStore
- Add a new method into HashMetaFieldRS supporting retrieving all values for each field.
- Abstract Readfield logic into FieldRs interface as a static class.
Cohort Process Logic: the new process logic follows the steps:
- Traverse the query.json file and get all required fields
- Retrieve all values of those fields into memory.
- Pre-defined type agnostics list projectTuple, it stores a record' information.
- For each row
  - Update its information into projectTuple. All following logics accept it as input.
  - Check if the user meets the birthSelection requirements by updating and checking the context
  - If the user is birthed, calculate this record's age, check which cohort the record belongs to, and finally update the query result according to the function defined in valueSelector.
Besides, the PR also re-defined aggregator, filter, and selector based on the new input projection - projectTuple
- For filter, the PR provides basic filter logic for set, range, and type
- For cohort selector, the PR provides cohort selector for both set and range, each of them extending from the filter while implementing the cohort selector interface.
Finally, all intermate data structure used in the above processing logic is defined in the storage folder. And some time-related functions are defined in utils.

Overall, the struct and logic are clear.

Currently, for each query, the system stores all related fileds' values into memory and are not released after the query is finished.

Although this can facilitate the further query, we cannot predict the visiting frequency of each field since there is no workload.

If the system runs as a service, all fields will eventually be loaded into memory. This may be inefficient.

Is it better to delete all cached data after finishing a query?

Zrealshadow commented 2 years ago

Thanks for the quick action. The decoupling made it much cleaner. I have no further comments. Please do a rebase against dev. That would exclude those olap related commits from this branch. Easier for others to view and for future references.

My suggestion is to directly merge and solve these conflict (which is not part of the main logic of this PR) Since the commits are too much, errors may occur during the process of rebasing all these commits.

KimballCai commented 2 years ago

I have checked the codes and can run the codes successfully.

Zrealshadow commented 2 years ago

move these aftercare work in issue #83

COOL-cohort / COOL

Rebuild cohort processing #72