Closed KimballCai closed 2 years ago
Please update the details here when reviewing the codes.
Now, program can run. However, there are some problem to unify and discuss.
For health
data, the input data in CohortProcessorTest
is raw data which have about 27w data items. In other unit test, the input test dataset is a sample of all dataset. In order to avoid ambiguity, I aliases the raw dataset as health_raw
.
For ecommerce
data, in CohortProcessorTest
, I have to pre-process the raw data first and then the cohort engine can analyze it. I store the processed dataset in ecommerce_query
.
Now under datasets directory, there are six datasets. Maybe we have to merge the ecommerce
and ecommerce_query
, health
and health_raw
.
This problem involves the previous issue https://github.com/COOL-cohort/COOL/issues/41#issue-1222199462. we should unify the format of the data involved and set a standard.
BTW, please remove commented-out codes (including // TODO Auto-generated method stub
), improper spacing and redundant imports.
This PR has made the following change:
datasetSource
to CubeRepo
Set
data type as an indicator.HashMetaFieldRS
supporting retrieving all values for each field. Readfield
logic into FieldRs interface as a static class.projectTuple
, it stores a record' information.projectTuple
. All following logics accept it as input. aggregator
, filter
, and selector
based on the new input projection - projectTuple
Overall, the struct and logic are clear.
Currently, for each query, the system stores all related fileds' values into memory and are not released after the query is finished.
Although this can facilitate the further query, we cannot predict the visiting frequency of each field since there is no workload.
If the system runs as a service, all fields will eventually be loaded into memory. This may be inefficient.
Is it better to delete all cached data after finishing a query?
Thanks for the quick action. The decoupling made it much cleaner. I have no further comments. Please do a rebase against dev. That would exclude those olap related commits from this branch. Easier for others to view and for future references.
My suggestion is to directly merge and solve these conflict (which is not part of the main logic of this PR) Since the commits are too much, errors may occur during the process of rebasing all these commits.
I have checked the codes and can run the codes successfully.
move these aftercare work in issue #83
Please update the details here when reviewing the codes. (issues to be addressed)
Create an issue and submit a corresponding PR for each one.
loadattr
that mutates internal data.