[Proposal] Resolve error "fail to init reader.res=-230"

morningman commented 4 years ago

Describe the bug

The specific meaning of this error can be found in #3270. And in PR #3271, we try to alleviate this problem by introducing a new BE config cumulative_compaction_skip_window_seconds.

However, this config does not completely solve this problem, because in the following two cases, we can not accurately set the config value.

Query with Join

In the query with Join operation, the query engine will first perform a Build operation on the right table, and wait until the right table is built before Open the left table for data reading. The operation of obtaining the rowset reader according to the version only starts when it is Open. If the build phase is very time-consuming, it will cause the Open on the left to be executed long after the query starts, and the version that should be read at this time may has been merged. The time of the build phase cannot be estimated, so it is difficult to set a reasonable value for the config cumulative_compaction_skip_window_seconds.

Transaction publish delay

Each BE merges its own tablet separately. Only the published version will be merged. The Publish here only represents the publish of BE's own tablet, not the corresponding successful publish of the load transaction. This causes the following problems, for example:

1. Assume that the version corresponding to a load transaction is 10.
2. In this load transaction, there are three replicas of a tablet, one of which is published successfully at 10:00, and the other two replicas are published successfully at 10:10. So the finish time of the entire load transaction is 10:10. In other words, users can query the 10 version of the data from 10:10. And because there is already one replica published at 10:00. That replica may have received the 11 version of the publish task at 10:02. Then at 10:05, a compaction is performed to form the [0-11] version. Then when 10:10, the user uses the 10 version to query, that replica will report an error "-230".

The above two situations can be alleviated by increasing the cumulative_compaction_skip_window_seconds configuration, but

1. The larger the config may cause the version backlog, and the compaction is not timely.
2. This is just an empirical value and cannot completely solve this problem.

Resolution

The essence of the problem is that before we read the specified version, the specified version has been merged. We can control the timing of compaction by recording the current largest read version in the tablet.

In the tablet, the current latest read version is recorded in field latest_read_version. This version is set every time the tablet is read, and only the largest version currently read is recorded.

When compaction selects a version for merging, only data whose version is less than or equal to latest_read_version is selected. This will ensure that if a version(or a version larger than it), has not been read, the version will not be compacted.

Of course, under this strategy, if there has been no read request, the compaction will not be triggered. So we need to set a threshold to prevent the data from never being merged. We can use cumulative_compaction_skip_window_seconds to configure this threshold. That is, when the threshold time is exceeded, latest_read_version will be ignored and the version merge will start.

This strategy may have problems in the following scenarios: the query frequency is low, such as once every 10 minutes, but the load frequency is high, such as 10 times per minute. In this case, if the cumulative_compaction_skip_window_seconds setting is large, it may accumulate large number of versions, eg, 100 versions. If the setting is small, the -230 error may still be generated. However, in terms of actual operation and maintenance experience, setting cumulative_compaction_skip_window_seconds to 5 minutes can basically solve most problems.

fanlongmengjd commented 4 years ago

According to We can use cumulative_compaction_skip_window_seconds to configure this threshold. That is, when the threshold time is exceeded, latest_read_version will be ignored and the version merge will start. sounds like compaction behavior is same even though adding latest_read_version, e.g

in previous , if cumulative_compaction_skip_window_seconds =30, then do compaction every 30s
now, if cumulative_compaction_skip_window_seconds =30 + latest_read_version, system will then do compaction every 30s and latest_read_version will be ignored

right?

kangkaisen commented 4 years ago

According to We can use cumulative_compaction_skip_window_seconds to configure this threshold. That is, when the threshold time is exceeded, latest_read_version will be ignored and the version merge will start. sounds like compaction behavior is same even though adding latest_read_version, e.g

in previous , if cumulative_compaction_skip_window_seconds =30, then do compaction every 30s

now, if cumulative_compaction_skip_window_seconds =30 + latest_read_version, system will then do compaction every 30s and latest_read_version will be ignored

right?

+1. I have the same question.

morningman commented 4 years ago

now, if cumulative_compaction_skip_window_seconds =30 + latest_read_version, system will then do compaction every 30s and latest_read_version will be ignored

No. with latest_read_version, the compaction may be more frequent. For example:

The current version is 10, publish at 10:00:00.
The following version 11 and 12 publish at 10:00:02 and 10:00:04.
A query request with read version 12 arrive at 10:00:05.

In the old way, the version 12 can be merged after 10:00:34. But in the new way, the version 12 can be merged after 10:00:05.

kangkaisen commented 4 years ago

@morningman I see.

    for (auto rowset : _input_rowsets) {
        if (rowset->end_version() <= _tablet->latest_read_version()
            || rowset->creation_time() + config::cumulative_compaction_skip_window_seconds < now) {
            transient_rowsets.push_back(rowset);
        } else {
            break;
        }
    }

So, by your design. The config::cumulative_compaction_skip_window_seconds need very large? If config::cumulative_compaction_skip_window_seconds is 30s, and the query time is 60s, we still get 230 error?

morningman commented 4 years ago

@morningman I see.
    for (auto rowset : _input_rowsets) {
        if (rowset->end_version() <= _tablet->latest_read_version()
            || rowset->creation_time() + config::cumulative_compaction_skip_window_seconds < now) {
            transient_rowsets.push_back(rowset);
        } else {
            break;
        }
    }
So, by your design. The config::cumulative_compaction_skip_window_seconds need very large? If config::cumulative_compaction_skip_window_seconds is 30s, and the query time is 60s, we still get 230 error?

No need. After upgrade, if you leave the cumulative_compaction_skip_window_seconds unchanged(default 30 sec), the behavior of cumu compaction is just same as origin(And even better because it try to avoid merging "not-read-yet" version).

And if you set cumulative_compaction_skip_window_seconds to a large value(eg, 5 minute), the behavior depends on:

If query frequency is reasonable(eg, at least once per minute), then this config has no effect in most cases.
If query frequency is low(eg, once per 10 minutes), then the compaction will do every 5 minutes, which act same as the origin.

morningman commented 4 years ago

In fact, this parameter is mainly to solve the "publish timeout" problem.

If the publish time is more consistent on all replicas, then this config is actually meaningless.

If the publish timeout occurs frequently, for example, the delay is up to 5 minutes, then it is recommended to set this value to greater than 5 minutes to solve the problem.

kangkaisen commented 4 years ago

For Query with Join. I think we could capture all tablets when olap_scan_node prepare.
For publish timeout, I think we should improve publish version performance.
I think we could add a MVVC GC mechanism like other database system. And then, we don't need latest_read_version and and cumulative_compaction_skip_window_seconds both, and the compaction rowsets picked logic is simple. we only need one config unused_rowset_gc_window_seconds.

morningman commented 4 years ago

2. For publish timeout, I think we should improve publish version performance.

Yes, publish timeout is the root cause of this problem. This part needs continuous optimization

3. I think we could add a MVVC GC mechanism like other database system.

If there is still a serious timeout problem, the MVCC mechanism cannot fundamentally solve this problem. Because we still face the problem of how to set the appropriate unused_rowset_gc_window_seconds value. If timeout is delayed by 10 minutes, then unused_rowset_gc_window_seconds must be set to greater than 10 minutes. However, MVCC can solve the problem of poor reading performance due to the accumulation of data versions during the delay.

In summary, latest read version is just a temporary solution that can alleviate the problem. And in most cases there should be no problems. In fact, @chaoyli has written a function that supports multi-path lookup of version graph, maybe we can implement a complete MVCC logic on this basis.

kangkaisen commented 4 years ago

In summary, latest read version is just a temporary solution that can alleviate the problem. And in most cases there should be no problems.

@morningman Yes, I agree with you.

But this problem has existed for a very long time(from 0.9 to 0.12), we always use a temporary solution to alleviate this problem. I think this time we should find a solution to completely solve this problem.

morningman commented 4 years ago

OK, I can leave this PR as a alternative that user can cherry-pick to solve their problem in short time.

kangkaisen commented 4 years ago

OK, I can leave this PR as a alternative that user can cherry-pick to solve their problem in short time.

Merge this PR to master is OK. What I want to emphasize is we should at least discuss and conclude a completely solution this time.

morningman commented 4 years ago

OK, I can leave this PR as a alternative that user can cherry-pick to solve their problem in short time.

Merge this PR to master is OK. What I want to emphasize is we should at least discuss and conclude a completely solution this time.

OK, I will create another issue to tracking this problem.

gaodayue commented 4 years ago

The problem could also occur when follower's metadata lags behind master by a substantial time. We encountered one case recently and I'd like to share the details.

First of all, master FE finishes txn quickly and normally every 5 second.

2020-06-17 14:14:08,315 INFO 32 [GlobalTransactionMgr.updateCatalogAfterVisible():1086] set partition 264562638's version to [10210] and version hash to [3776798707510922300]
2020-06-17 14:14:13,256 INFO 32 [GlobalTransactionMgr.updateCatalogAfterVisible():1086] set partition 264562638's version to [10211] and version hash to [6996331758766475117]
2020-06-17 14:14:18,300 INFO 32 [GlobalTransactionMgr.updateCatalogAfterVisible():1086] set partition 264562638's version to [10212] and version hash to [1122327101224378913]

However, one of the followers encounters some transient issue and its metadata lags behind master during 14:14~14:16. This causes some queries picking obsolete version 10210 before 14:15:47, which then fails with -230 err code.

2020-06-17 14:14:52,896 INFO 81 [GlobalTransactionMgr.updateCatalogAfterVisible():1086] set partition 264562638's version to [10210] and version hash to [3776798707510922300]
2020-06-17 14:15:47,789 INFO 81 [GlobalTransactionMgr.updateCatalogAfterVisible():1086] set partition 264562638's version to [10211] and version hash to [6996331758766475117]
2020-06-17 14:16:36,344 INFO 81 [GlobalTransactionMgr.updateCatalogAfterVisible():1086] set partition 264562638's version to [10212] and version hash to [1122327101224378913]
2020-06-17 14:16:36,407 INFO 81 [GlobalTransactionMgr.updateCatalogAfterVisible():1086] set partition 264562638's version to [10213] and version hash to [4855950519590045391]

The proposed latest_read_version can't help in this case. I think MVVC is the way to go at the end of the day.

In fact, @chaoyli has written a function that supports multi-path lookup of version graph, maybe we can implement a complete MVCC logic on this basis.

+1. What's the progress of this work? I'd like to help if possible.

morningman commented 4 years ago

+1. What's the progress of this work? I'd like to help if possible.

Just started, one of our team is working this. We can make a discuss after investigation.

apache / doris

[Proposal] Resolve error "fail to init reader.res=-230" #3859