dmwm / dbs2go

DBS server written in Go
MIT License
5 stars 4 forks source link

DBS R&D for large tables #84

Open vkuznet opened 2 years ago

vkuznet commented 2 years ago

With growths of DBS data we need to perform R&D to address large tables

vkuznet commented 2 years ago

Here is a brief plan of R&D activities we need to perform:

d-ylee commented 2 years ago

As per our discussion, the reasoning for doing this is because the HTTP front end API has a 5 minute timeout. Injectino of FileLumis is limited to 2-3M records per block before timeout. Fetching also takes time with an increased amount of data.

We need to first evaluate using both MongoDB and ElasticSearch. This would first require fetching FileLumis from current deployments and do an injection.

SQL For reference:

amaltaro commented 2 years ago

@d-ylee @vkuznet based on this information above, should we try to limit block sizes - in terms of number of lumis - to 1M lumis at top? Maybe we even cap it to .5M lumis per block? Once we decide on the threshold, we should feed this back to this WMCore GH issue: https://github.com/dmwm/WMCore/issues/10264

vkuznet commented 2 years ago

It will certainly be helpful to put a limit on number of lumis since so far there is no limit and as such there is a potential to go above the limit on FEs. Based on initial benchmark of time taking by bulkblocks injsertion API the it can stays within 5 min if number of lumis not exceed few millions, e.g. 2-3. Therefore, a limit of 1M is good to have in place. To improve performance it is also better to limit it further to 0.5M but I do not know if it will have any side effect on DM side.

vkuznet commented 2 years ago

In addition to reasoning @d-ylee mentioned. This R&D will explore a possibility to add more unstructured meta-data to DBS information. Recently, we listen to I. Mandrichenko talk MetaCat - meta-data catalog for Rucio-based data management system where he argued Run conditions, File provenance meta-data can be stored as non-structural data into NoSQL DB which can provide better performance for queries than structured DBS information.