Open morningman opened 4 years ago
According to We can use cumulative_compaction_skip_window_seconds to configure this threshold. That is, when the threshold time is exceeded, latest_read_version will be ignored and the version merge will start.
sounds like compaction behavior is same even though adding latest_read_version, e.g
right?
According to
We can use cumulative_compaction_skip_window_seconds to configure this threshold. That is, when the threshold time is exceeded, latest_read_version will be ignored and the version merge will start.
sounds like compaction behavior is same even though adding latest_read_version, e.g
- in previous , if cumulative_compaction_skip_window_seconds =30, then do compaction every 30s
- now, if cumulative_compaction_skip_window_seconds =30 + latest_read_version, system will then do compaction every 30s and latest_read_version will be ignored
right?
+1. I have the same question.
- now, if cumulative_compaction_skip_window_seconds =30 + latest_read_version, system will then do compaction every 30s and latest_read_version will be ignored
No. with latest_read_version
, the compaction may be more frequent. For example:
In the old way, the version 12 can be merged after 10:00:34. But in the new way, the version 12 can be merged after 10:00:05.
@morningman I see.
for (auto rowset : _input_rowsets) {
if (rowset->end_version() <= _tablet->latest_read_version()
|| rowset->creation_time() + config::cumulative_compaction_skip_window_seconds < now) {
transient_rowsets.push_back(rowset);
} else {
break;
}
}
So, by your design. The config::cumulative_compaction_skip_window_seconds
need very large?
If config::cumulative_compaction_skip_window_seconds is 30s, and the query time is 60s, we still get
230 error?
@morningman I see.
for (auto rowset : _input_rowsets) { if (rowset->end_version() <= _tablet->latest_read_version() || rowset->creation_time() + config::cumulative_compaction_skip_window_seconds < now) { transient_rowsets.push_back(rowset); } else { break; } }
So, by your design. The
config::cumulative_compaction_skip_window_seconds
need very large? If config::cumulative_compaction_skip_window_seconds is 30s, and the query time is 60s, we still get 230 error?
No need. After upgrade, if you leave the cumulative_compaction_skip_window_seconds
unchanged(default 30 sec), the behavior of cumu compaction is just same as origin(And even better because it try to avoid merging "not-read-yet" version).
And if you set cumulative_compaction_skip_window_seconds
to a large value(eg, 5 minute), the behavior depends on:
In fact, this parameter is mainly to solve the "publish timeout" problem.
If the publish time is more consistent on all replicas, then this config is actually meaningless.
If the publish timeout occurs frequently, for example, the delay is up to 5 minutes, then it is recommended to set this value to greater than 5 minutes to solve the problem.
For Query with Join
. I think we could capture all tablets when olap_scan_node prepare.
For publish timeout
, I think we should improve publish version performance.
I think we could add a MVVC GC mechanism like other database system. And then, we don't need latest_read_version and
and cumulative_compaction_skip_window_seconds
both, and the compaction rowsets picked logic is simple. we only need one config unused_rowset_gc_window_seconds
.
2. For
publish timeout
, I think we should improve publish version performance.
Yes, publish timeout is the root cause of this problem. This part needs continuous optimization
3. I think we could add a MVVC GC mechanism like other database system.
If there is still a serious timeout problem, the MVCC mechanism cannot fundamentally solve this problem. Because we still face the problem of how to set the appropriate unused_rowset_gc_window_seconds
value. If timeout is delayed by 10 minutes, then unused_rowset_gc_window_seconds
must be set to greater than 10 minutes. However, MVCC can solve the problem of poor reading performance due to the accumulation of data versions during the delay.
In summary, latest read version
is just a temporary solution that can alleviate the problem. And in most cases there should be no problems. In fact, @chaoyli has written a function that supports multi-path lookup of version graph, maybe we can implement a complete MVCC logic on this basis.
In summary, latest read version is just a temporary solution that can alleviate the problem. And in most cases there should be no problems.
@morningman Yes, I agree with you.
But this problem has existed for a very long time(from 0.9 to 0.12), we always use a temporary solution to alleviate this problem. I think this time we should find a solution to completely solve this problem.
OK, I can leave this PR as a alternative that user can cherry-pick to solve their problem in short time.
OK, I can leave this PR as a alternative that user can cherry-pick to solve their problem in short time.
Merge this PR to master is OK. What I want to emphasize is we should at least discuss and conclude a completely solution this time.
OK, I can leave this PR as a alternative that user can cherry-pick to solve their problem in short time.
Merge this PR to master is OK. What I want to emphasize is we should at least discuss and conclude a completely solution this time.
OK, I will create another issue to tracking this problem.
The problem could also occur when follower's metadata lags behind master by a substantial time. We encountered one case recently and I'd like to share the details.
First of all, master FE finishes txn quickly and normally every 5 second.
2020-06-17 14:14:08,315 INFO 32 [GlobalTransactionMgr.updateCatalogAfterVisible():1086] set partition 264562638's version to [10210] and version hash to [3776798707510922300]
2020-06-17 14:14:13,256 INFO 32 [GlobalTransactionMgr.updateCatalogAfterVisible():1086] set partition 264562638's version to [10211] and version hash to [6996331758766475117]
2020-06-17 14:14:18,300 INFO 32 [GlobalTransactionMgr.updateCatalogAfterVisible():1086] set partition 264562638's version to [10212] and version hash to [1122327101224378913]
However, one of the followers encounters some transient issue and its metadata lags behind master during 14:14~14:16. This causes some queries picking obsolete version 10210 before 14:15:47, which then fails with -230 err code.
2020-06-17 14:14:52,896 INFO 81 [GlobalTransactionMgr.updateCatalogAfterVisible():1086] set partition 264562638's version to [10210] and version hash to [3776798707510922300]
2020-06-17 14:15:47,789 INFO 81 [GlobalTransactionMgr.updateCatalogAfterVisible():1086] set partition 264562638's version to [10211] and version hash to [6996331758766475117]
2020-06-17 14:16:36,344 INFO 81 [GlobalTransactionMgr.updateCatalogAfterVisible():1086] set partition 264562638's version to [10212] and version hash to [1122327101224378913]
2020-06-17 14:16:36,407 INFO 81 [GlobalTransactionMgr.updateCatalogAfterVisible():1086] set partition 264562638's version to [10213] and version hash to [4855950519590045391]
The proposed latest_read_version
can't help in this case. I think MVVC is the way to go at the end of the day.
In fact, @chaoyli has written a function that supports multi-path lookup of version graph, maybe we can implement a complete MVCC logic on this basis.
+1. What's the progress of this work? I'd like to help if possible.
+1. What's the progress of this work? I'd like to help if possible.
Just started, one of our team is working this. We can make a discuss after investigation.
Describe the bug
The specific meaning of this error can be found in #3270. And in PR #3271, we try to alleviate this problem by introducing a new BE config
cumulative_compaction_skip_window_seconds
.However, this config does not completely solve this problem, because in the following two cases, we can not accurately set the config value.
In the query with Join operation, the query engine will first perform a Build operation on the right table, and wait until the right table is built before
Open
the left table for data reading. The operation of obtaining the rowset reader according to the version only starts when it isOpen
. If the build phase is very time-consuming, it will cause theOpen
on the left to be executed long after the query starts, and the version that should be read at this time may has been merged. The time of the build phase cannot be estimated, so it is difficult to set a reasonable value for the configcumulative_compaction_skip_window_seconds
.Each BE merges its own tablet separately. Only the published version will be merged. The
Publish
here only represents the publish of BE's own tablet, not the corresponding successful publish of the load transaction. This causes the following problems, for example:The above two situations can be alleviated by increasing the
cumulative_compaction_skip_window_seconds
configuration, butResolution
The essence of the problem is that before we read the specified version, the specified version has been merged. We can control the timing of compaction by recording the current largest read version in the tablet.
In the tablet, the current latest read version is recorded in field
latest_read_version
. This version is set every time the tablet is read, and only the largest version currently read is recorded.When compaction selects a version for merging, only data whose version is less than or equal to
latest_read_version
is selected. This will ensure that if a version(or a version larger than it), has not been read, the version will not be compacted.Of course, under this strategy, if there has been no read request, the compaction will not be triggered. So we need to set a threshold to prevent the data from never being merged. We can use
cumulative_compaction_skip_window_seconds
to configure this threshold. That is, when the threshold time is exceeded,latest_read_version
will be ignored and the version merge will start.This strategy may have problems in the following scenarios: the query frequency is low, such as once every 10 minutes, but the load frequency is high, such as 10 times per minute. In this case, if the
cumulative_compaction_skip_window_seconds
setting is large, it may accumulate large number of versions, eg, 100 versions. If the setting is small, the-230
error may still be generated. However, in terms of actual operation and maintenance experience, settingcumulative_compaction_skip_window_seconds
to 5 minutes can basically solve most problems.