4paradigm / OpenMLDB

OpenMLDB is an open-source machine learning database that provides a feature platform computing consistent features for training and inference.
https://openmldb.ai
Apache License 2.0
1.59k stars 321 forks source link

segment gc when ts_cnt > 1 #3935

Open vagetablechicken opened 6 months ago

vagetablechicken commented 6 months ago

https://github.com/4paradigm/OpenMLDB/blob/21184d56251cd96088d787dfdb32527c84c78467/src/storage/segment.cc#L437-L458

ref https://utqcxc5xn1.feishu.cn/docx/FTbtdV25eoZDkjxODpCc44qhnlc , if we have a table with indexes in same keys but different ts, e.g.

CREATE TABLE talkingdata(
    ip int,app int,device int,os int,channel int,click_time timestamp,attributed_time timestamp,is_attributed int,
    index(key=(ip), ts=click_time, ttl=1s, ttl_type=absolute),
    index(key=(ip), ts=attributed_time),
    index(key=(app,os), ts=click_time)
);

index0 and index1 will in the same segment and ts_cnt_==2, so segment gc will trigger GcAllType, it'll use the wrong expire time.

when ts_cnt_<=1, ExecuteGc will calc expire time: https://github.com/4paradigm/OpenMLDB/blob/21184d56251cd96088d787dfdb32527c84c78467/src/storage/segment.cc#L398-L408

But GcAllType won't, it'll use a small time (ttl value, not the expire time, e.g. ttl=1m, time value will be 1970-01-01) to do gc. Normally, no row will be gc cuz row ts > small time, so the data never expire, you can check by show table status.