apache / doris

Apache Doris is an easy-to-use, high performance and unified analytics database.
https://doris.apache.org
Apache License 2.0
12.63k stars 3.26k forks source link

[Feature] Doris FileCache memory info vs disk info consistency check #41280

Open freemandealer opened 1 month ago

freemandealer commented 1 month ago

Search before asking

Description

Occasionally, we found that there have been cases of disk cache data escaping from the management of Doris file cache, causing disk space leaks. To make it easier for debugging, we need a checking tool that compares the contents in the Doris file cache memory management structure with the current disk contents to identify the differences between the two (which are potential problematic data).

To better understand how file cache works, please refer to: https://doris.apache.org/zh-CN/docs/dev/compute-storage-decoupled/file-cache/ and https://www.bilibili.com/video/BV1ath9eGEqL

Basic Ideas

Coz the cache is changing rapidly, we should freeze the cache (via lock) to get a snapshot of current status.

Then parse the status to get which data should be cached.

And scan the disk (also during the freeze) to see which data indeed exists.

Finally compare the above two and print the diff in logs.

Implementation Tips

We could use Restful API to trigger the check. FYI, check be/src/http/action/file_cache_action.cpp for more details of Restful API support in Doris.

If you get in any trouble ...

Do not hesitate to contact me by WeChat 15811301868

Related issues

No response

Are you willing to submit PR?

Code of Conduct

Lupinus commented 1 week ago

I'd like to work on it, please assign it to me

freemandealer commented 1 week ago

I'd like to work on it, please assign it to me

sure, thanks for you participation and welcome to Doris community.

Lupinus commented 4 days ago

The main idea is to scan the disk once and then scan the _files again.

In FileCacheStorage, add a virtual function unordered_set checkConsistency(BlockFileCache* _mgr, lambda handler); where handler is a lambda used to handle inconsistent AccessKeyAndOffset entries, recording any inconsistencies found. An inconsistency means that an AccessKeyAndOffset exists only in either BlockFileCache or FSFileCacheStorage.

In the implementation of checkConsistency in FSFileCacheStorage, the main task is to iterate through the fileBlock directory items under _cache_base_path, checking for their existence in _files of BlockFileCache and whether their sizes are consistent. If an entry does not exist, the handler is called; if it exists, it is recorded in an unordered_set (used for the return value).

In BlockFileCache, add a function checkConsistency, which has two main parts. The first part calls the _storage’s checkConsistency, obtaining its return value (an unordered_set that records which AccessKeyAndOffset entries have already been found during the disk scan). The second part iterates through _files, and if any item is not found in the unordered_set, it calls the handler to record this inconsistency, ultimately returning these inconsistent items.

In terms of the API, in FileCacheAction, add two types of operations. One is to input a path and check the consistency of that path, which essentially calls BlockFileCache's checkConsistency. The second is to obtain all paths (to facilitate the use of the first operation). Is it ok for the API's return to be inconsistent file names and offsets? Any suggestions regarding function naming? It is a frustrating issue.

freemandealer commented 3 days ago

hi @Lupinus

In FileCacheStorage, add a virtual function unordered_set checkConsistency(BlockFileCache* _mgr, lambda handler)

We don't need _mgr as its parameter since the call chain is as follows: FileCacheAction -> BlockFileCache(holding _mutex) -> specific storage

checking for their existence in _files of BlockFileCache and whether their sizes are consistent.

Additionally, please consider the consistency of their metadata. Please take a look at FileBlock::cache_type() & FileBlock::expiration_time(). These metadata are encoded into the directory name and file name in the filesystem and refer to FSFileCacheStorage::load_cache_info_into_memory for details.

BTW, when using unorderd_map, the map key should include all the above-mentioned info because the file path itself could be duplicated (but with different cache types or expiration_time).

Is it ok for the API's return to be inconsistent file names and offsets?

Given that the inconsistency could be in two categories, i.e. missing in _files v.s. missing in filesystem, we should point that out along with the file path (not file name alone, should be file path) in the HTTP reponse.

Any suggestions regarding function naming? It is a frustrating issue.

No problem with the naming. And what do you mean by 'a frustrating issue'? Is it too easy or hard for you? If there is any problem with the issue itself, please help me improve it. I appreciate your help in advance.

Lupinus commented 3 days ago

Sorry for not expressing myself clearly, "Issue" refers to naming a function.