DBS API BulkBlock input size control

yuyiguo commented 5 years ago

@bbockelm @amaltaro @belforte @vkuznet and @all DBS database became bigger with time and DBS servers are getting more loads. We no longer have the luxury to load huge files. The most recently issues was a block with 500 files, sized about 200 MB and 1,643,229 lumi sections. This block could not even be load entire data trough the front end.

Now it is the time that we start look into how DBS should make the limit. What is reasonable limits? Limits on block size, number of files and number of lumi sections in a block?

Currently, WMagents have total 500 files per block as limit. But the files various a lot. I am not sure what the limit crab put in.

vkuznet commented 5 years ago

Yuyi, I think you answered your own question, it is not normal that frontend should stuck with 200MB request to process it and it is not normal that DBS should struggle too. My feeling that we should introduce throttling at both.

The frontend throttling will slow down frequent clients while DBS will weight on client's patterns.

For apache we can use mod_evasive or mod_throttle, while for DBS backend I didn't find explicitly cherrypy solution and we should probably write our own. But for Flask we can use http://flask.pocoo.org/snippets/70/

In a past I already made everything for mod_evasive which is part of cmsdist repository now and we just need to revisit its specs and configuration.

Valentin.

On 0, Yuyi Guo notifications@github.com wrote:

@bbockelm @amaltaro @belforte @vkuznet and @all DBS database became bigger with time and DBS servers are getting more loads. We no longer have the luxury to load huge files. The most recently issues was a block with 500 files, sized about 200 MB and 1,643,229 lumi sections. This block could not even be load entire data trough the front end.

Now it is the time that we start look into how DBS should make the limit. What is reasonable limits? Limits on block size, number of files and number of lumi sections in a block?

Currently, WMagents have total 500 files per block as limit. But the files various a lot. I am not sure what the limit crab put in.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/599

belforte commented 5 years ago

About last point: CRAB Publisher is currently configured for 100 files/block. There is also a limit on how many lumis can be in an input block at 100k. Since one job can not cross block boundaries this gives a max. of 100k lumis in one output file if someone process data which are "at the edge" and wants output in DBS.

Is this the time to question whether and how we want to store lumilist for nanoAOD. So far the above limitation results in "can read nanoAOD with CRAB, but only with splitByFile". But I suspect nobody tried to store in DBS output of "nanoAOD skimming" resulting in even more lumis/file. In the end someone could have search analysis producing one file with maybe 100 events, but all lumis in all CMS runs !

belforte commented 5 years ago

Valentin, protecting DBS from code which ran astray is good, but here we need also to define how we should use DBS so that it keeps working smoothly for us. Breaking large inputs in pieces may avoid FE timeouts, but do we really need to push those enormous JSON lists in Oracle ?

amaltaro commented 5 years ago

Thanks for starting this discussion Yuyi. Another possibility would be to break this bulkBlock API in multiple/many different APIs, such that we can send less data in each call (of course, at the cost of a higher number of HTTP requests and more micromanagement on the client side). It looks like we could have a different insert API for file_conf_list, file_parent_list and files (which contains lumis).

We also have to come up with better thresholds for the clients (aka CRAB and WMAgent). Imposing these limitations will always be sup-optimal though and people need to be aware that it won't come for free (like small blocks here and there).

BTW, that json is very large because we post that info to DBS with the keys/binds already formatted for the DAO, that's why the volume is large (which saves quite some CPU cycles on the DBS server side).

belforte commented 5 years ago

Alan, which kind of dataset was that huge block for ? I do not see how 500 files could be a problem, but 1.6 M lumis in ASCII formatted JSON really sounds a lot to digest. Why do we store lumilist in a relational DB ? Is it only for answering the question "give me the file(s) in this dataset which contain lumi X from run R" ? I do not see that question as useful for highly compacted data tiers.

vkuznet commented 5 years ago

On this subject, how lumis are provided, as array of ints? Can we use range of lumis which may significantly reduce the size of uploaded document? On 0, Stefano Belforte notifications@github.com wrote:

Alan, which kind of dataset was that huge block for ? I do not see how 500 files could be a problem, but 1.6 M lumis in ASCII formatted JSON really sounds a lot to digest. Why do we store lumilist in a relational DB ? Is it only for answering the question "give me the file(s) in this dataset which contain lumi X from run R" ? I do not see that question as useful for highly compacted data tiers.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/599#issuecomment-483721421

belforte commented 5 years ago

CRAB uses the format from this example to fill the structure to be passed to insertBulkBlock https://github.com/dmwm/DBS/blob/master/Client/tests/dbsclient_t/unittests/blockdump.dict since we could not find any other documentation. In that every file is a list of dictionaries one of which is a list of {'run':int; 'lumi':int} We have not touched that code since it was written early in DBS3 history.

I would distinguish three things here:

how to pass that list efficiently (ranges may not get more than a few O(1) factors since there are many gaps lumis are scattered almost at random in initial RAW files)
how to store that information (i.e. which kind of query and/or retrieval do we want)
when to store it (i.e. for which files/datasets)

yuyiguo commented 5 years ago

Thanks all for the discussion here. For the huge block of 200 files, you may find some f details at https://its.cern.ch/jira/browse/CMSCOMPPR-5196.

Regarding the input data format, the link Stefano pointed out is the current input requirement. We designed this format because we want to have the date to be insert without reformating in DBS. but if this format is the problem, we definitely can redesign it to reduce the input volume. However, we have the 300 minutes limit posted on the server, reformatting the data will increase the time and memory in DBS. What we want to trade here? Even if we reduce the data volume to passing into DBS, 1.6 millions of lumi inserted into DBS still is a big challenge. So my idea is to find a balance.

yuyiguo commented 5 years ago

Breaking the bulkblock insertion into multiple API was what DBS2 did. All the people experienced DBS2 knew what the problems were. I would not discuss here. I do not think that we are going back that route unless we really want to redesign DBS for DBS4.

belforte commented 5 years ago

Proposal: WMA and CRAB should set a limit at 10k lumis per block. Period. And let's see what breaks, if any, and if it really needs to be fixed let's fix in in a different way then redesigning DBS.

I do not know about WMA, but CRAB will refuse to read the lumilist for a block which has more than 100K lumis, I really do not see the point in creating that monster.

Then looking at this long and confused thread (thanks Yuyi) https://github.com/dmwm/DBS/issues/599:

IIUC this crap originates from some GS (GenSim ?) thins where clearly the #lumis/job must have been set completely wrong. If that's the case the only cure is to detect as early as possible and send crap back upstream as quick as possible. Trying to accommodate any silly request that comes our way is not good.

In https://its.cern.ch/jira/browse/CMSCOMPPR-5196 at some point JR correctly points out that 3ev/lumi is silly. Lumis in Gen are ONLY there to allow to process output at a sub-file level using split-by-lumi, 3 eve/lumis makes no sense and must be rejected upfront. Why did we process this anyhow ? Why do people do ACDC for GenSim ?

We will be better off by revisiting the lumi in Gen thing and stop putting lumis there, since differently from data those lumis do not come from real life (# seconds) and there's no limit to how much problems we can have from wrong configurations. Why do we have to spend time debugging how to insert such block in DBS ?

We should not just push things around blindly until they somehow "go", but find the core issue down at the root and solve things there. Where's the DESIGN part here ?

P.S. yet I am glad this came about, because I already asked long ago for a DBS-side defined limit on what it could handle so I break things CRAB-side, but could not get an answer. Although I understand that the first limit comes from the 5min CMSWEB FE timeout, large requests may never reach DBS BE. In new CMSWEB arch this may change, but I would not like clients to keep connections which lasts minutes anyhow.

belforte commented 5 years ago

and clearly for a GenSim dataset there is absolutely no reason to be prepared to answer "give me the file which contains lumi number X". Why do we push that list in an Oracle table ? Masochism ?

vkuznet commented 5 years ago

I had a quick look at data-format, it is WAY TOO LOOSE and current representation can be cut in 1/2 very easily. Here is few suggestions:

the dict key names are too long, compare lumi_section_num vs 123
- so if you'll replace long string names with their short representation you can cut significant amount of data, e.g. use ls instead of lumi_section_num, use fl instead of file_lumi_list, etc.
replace nested structures with flat format, e.g. instead of {'file_lumi_list':[{u'lumi_section_num': 27414, u'run_num': 1}, {u'lumi_section_num': 26422,u'run_num': 2} ...]} which takes TOO MUCH memory when python tries to allocate every dict in a list. This should be replaced with simpler structure like: {'fl':[(27414,1),(26422,2),...]} The advantage of later that python will allocate only ONE dict instead of many. This will not only reduce size of the input, but also reduce size of memory allocation on DBS server, i.e. win-win situation.

you can further optimize data-format by using lists instead of list of dicts, e.g. instead of using

'file_conf_list': [{u'release_version': 'CMSSW_1_2_3', u'pset_hash':
'76e303993a1c2f842159dbfeeed9a0dd', u'lfn': '/store/data/a/b/A/a/1/abcd0.root',
u'app_name': 'cmsRun', u'output_module_label': 'Merged', u'global_tag': 'my-cms-gtag::ALL'},
{u'release_version': 'CMSSW_1_2_3', u'pset_hash':
'76e303993a1c2f842159dbfeeed9a0dd', u'lfn': '/store/data/a/b/A/a/1/abcd1.root',
u'app_name': 'cmsRun', u'output_module_label': 'Merged' , u'global_tag': 'my-cms-gtag::ALL'},
...
]

you can use flat structure, e.g.

"fl":[
['CMSSW_1_2_3','76e303993a1c2f842159dbfeeed9a0dd','/store/data/a/b/A/a/1/abcd0.root',...]
['CMSSW_1_2_3','76e303993a1c2f842159dbfeeed9a0dd','/store/data/a/b/A/a/1/abcd1.root',...]
]

this will further reduce the data size.

I bet that only doing this optimization you can reduce 200MB to O(10)MB.

I understand that it will require changes to both DBS server and clients but using JSON without thinking about consequences is not optimal. We're not in mercy to waste resources anymore and proper optimization should be in place.

If you'll agree we can outline proper format for (at least this) DBS API and start campaign of enforcing new data-format.

On 0, Stefano Belforte notifications@github.com wrote:

CRAB uses the format from this example to fill the structure to be passed to insertBulkBlock https://github.com/dmwm/DBS/blob/master/Client/tests/dbsclient_t/unittests/blockdump.dict since we could not find any other documentation. In that every file is a list of dictionaries one of which is a list of {'run':int; 'lumi':int} We have not touched that code since it was written early in DBS3 history.

I would distinguish three things here:

how to pass that list efficiently (ranges may not get more than a few O(1) factors since there are many gaps lumis are scattered almost at random in initial RAW files)

how to store that information (i.e. which kind of query and/or retrieval do we want)

when to store it (i.e. for which files/datasets)

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/599#issuecomment-483729241

belforte commented 5 years ago

@vkuznet clearly a leaner protocol will help. Maybe such a change can be kept inside current DBS client API to avoid changes to WMA/CRAB ? E.g. flat lists surely are efficient but are error-prone to be given for use to naive code writers (like me), while a well coded and validated method can take the verbose thing and zip it at best. Why not start with insertBulkBlock returning an error when it thinks input is too large ? Then it can surely relax the limits once is able to reduce to more compact structure and evaluate that.

OTOH I hope we can also make some progress on what exactly we need from DBS. Even if we could store 10M lumis for one block, do we really want to do it ?

vkuznet commented 5 years ago

Stefano, flat budget, reduced manpower and increase of load due to more data give us now choice but be efficient. We should not think about convenience of "reading" our data in human format, but rather concentrate on efficiency of our system. I don't mind to keep format changes isolated to DBS server and client, but I think it is Yuyi's call.

And, decision on what to be stored in DBS is a parallel and independent issue from data flow optimization.

On 0, Stefano Belforte notifications@github.com wrote:

@vkuznet clearly a leaner protocol will help. Maybe such a change can be kept inside current DBS client API to avoid changes to WMA/CRAB ? E.g. flat lists surely are efficient but are error-prone to be given for use to naive code writers (like me), while a well coded and validated method can take the verbose thing and zip it at best. Why not start with insertBulkBlock returning an error when it thinks input is too large ? Then it can surely relax the limits once is able to reduce to more compact structure and evaluate that.

OTOH I hope we can also make some progress on what exactly we need from DBS. Even if we could store 10M lumis for one block, do we really want to do it ?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/599#issuecomment-483802098

yuyiguo commented 5 years ago

I have a release on Monday. I will go over the discussion later.

From: Valentin Kuznetsov notifications@github.com Reply-To: dmwm/DBS reply@reply.github.com Date: Tuesday, April 16, 2019 at 2:53 PM To: dmwm/DBS DBS@noreply.github.com Cc: Yuyi Guo yuyi@fnal.gov, Author author@noreply.github.com Subject: Re: [dmwm/DBS] DBS API BulkBlock input size control (#599)

Stefano, flat budget, reduced manpower and increase of load due to more data give us now choice but be efficient. We should not think about convenience of "reading" our data in human format, but rather concentrate on efficiency of our system. I don't mind to keep format changes isolated to DBS server and client, but I think it is Yuyi's call.

And, decision on what to be stored in DBS is a parallel and independent issue from data flow optimization.

On 0, Stefano Belforte notifications@github.com wrote:

@vkuznet clearly a leaner protocol will help. Maybe such a change can be kept inside current DBS client API to avoid changes to WMA/CRAB ? E.g. flat lists surely are efficient but are error-prone to be given for use to naive code writers (like me), while a well coded and validated method can take the verbose thing and zip it at best. Why not start with insertBulkBlock returning an error when it thinks input is too large ? Then it can surely relax the limits once is able to reduce to more compact structure and evaluate that.

OTOH I hope we can also make some progress on what exactly we need from DBS. Even if we could store 10M lumis for one block, do we really want to do it ?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/599#issuecomment-483802098 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_DBS_issues_599-23issuecomment-2D483802098&d=DwQFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=iJNoFTSxd_aSdfikrFUYis_Lzat1gF9NTaDQU2KXXQg&s=grvPo3dDxoNrLe6PzfCSHPIAHqKouBnVrrn35yJee1A&e=

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_DBS_issues_599-23issuecomment-2D483819211&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=iJNoFTSxd_aSdfikrFUYis_Lzat1gF9NTaDQU2KXXQg&s=kR2m48akomtOXyzLswO660AvUTjWv-xh9EU5xME-UaQ&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ABsXTj-2DDtMCmkAb00bbDZ4dPaD3Tyf-5F0ks5vhipCgaJpZM4cw8E3&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=iJNoFTSxd_aSdfikrFUYis_Lzat1gF9NTaDQU2KXXQg&s=Ok0aS6JHBJpERkaxPg8xpFUn8Cz_Ih2Ia54jQuZXIyo&e=.

bbockelm commented 5 years ago

Is it possible the issue is not the size of the lumi information but rather how we are loading it?

That is, any web server worth its salt should be able to easily handle a 200MB POST -- however, it's going to be extremely difficult to manage such a thing if all 200MB have to be buffered to memory at once! @vkuznet - does the frontend need to load the full POST before it can start proxying the request to the remote side?

A few thoughts:

How many APIs (or API implementations) need "fixed"?
Do we need to switch to a streaming JSON decoder/encoder? Is there a reason to render the whole structure in memory inside DBS?
If we are treating the lumi information as opaque blobs, why not compress them and never fully decompress on the server side?

Looks like a little medicine in the implementation might be able to go a long way, especially with respect to a streaming JSON decoder.

vkuznet commented 5 years ago

Brian, see my comments inline

Is it possible the issue is not the size of the lumi information but rather how we are loading it?

I didn't look explicitly into DBS/CRAB APIs, but it seems to me that answer is yes based on current document structure, i.e. we send whole JSON which contains nested data-structures which cause DBS memory blow-out.

That is, any web server worth its salt should be able to easily handle a 200MB POST -- however, it's going to be extremely difficult to manage such a thing if all 200MB have to be buffered to memory at once! @vkuznet - does the frontend need to load the full POST before it can start proxying the request to the remote side?

I doubt that frontend need to load full POST request, we can use chunked POST requests if necessary.

A few thoughts:

How many APIs (or API implementations) need "fixed"?

Do we need to switch to a streaming JSON decoder/encoder? Is there a reason to render the whole structure in memory inside DBS?

yes, we need JSON streaming and I think we have implementation for that in WMCore, but to implement it on DBS side we need to change data format(s)

If we are treating the lumi information as opaque blobs, why not compress them and never fully decompress on the server side?

it is a good suggestion, but we need to be carefully here since user wants to look-up back this information, so decompression will happen in a different part (e.g. DAS).

Looks like a little medicine in the implementation might be able to go a long way, especially with respect to a streaming JSON decoder.

yes

belforte commented 5 years ago

It already creates annoying memory problems in CRAB code when we build this JSON, which is why we still have to 'publish in DBS' in an adhoc machine rather than as part of job post-processing in the schedd. (and one of the reason why CRAB does not try to put more than 100 files in one block)

On 18/04/2019 11:22, Valentin Kuznetsov wrote:

I didn't look explicitly into DBS/CRAB APIs, but it seems to me that answer is yes based on current document structure, i.e. we send whole JSON which contains nested data-structures which cause DBS memory blow-out.

belforte commented 5 years ago

seem my previous question: why do we store this ? To serve the list to users upon request, or to allow Oracle find 'all files for lumi X, run Y in the whole CMS sample' ? We do not care to be able to accommodate any silly user request, but must be sure physics is still done. It is not only the sending of this, I worry (maybe w/o reason) about the biggest table in DBS getting bigger and bigger.

4/2019 11:22, Valentin Kuznetsov wrote:

If we are treating the lumi information as opaque blobs, why not compress them and never fully decompress on the server side?

it is a good suggestion, b

vkuznet commented 5 years ago

Let me point you to a history discussion with Lassi which I had 8 years ago: https://hypernews.cern.ch/HyperNews/CMS/get/webInterfaces/753.html https://hypernews.cern.ch/HyperNews/CMS/get/webInterfaces/752.html

Over there he proposed and we discussed compact JSON format (by the time it was relevant for PhEDEx and DAS). In particular, I look-up my emails and found that (quoting):

The tests I've done in a past shown that 200 MB of phedex data (current JSON data structure of all blocks) requires > 1 GB of RAM for JSON parsing, while the parsing of the same data using XML can be done at a cost of 20MB of RAM (+). The phedex JSON representation is basically list holding dicts (RAM grows due to allocation of dicts in open list). ...

By that time PhEDEx JSON format was similar to what DBS uses now, i.e. JSON which holds nested data-structures. My measurements shown that identical XML representation had 10 times less memory consumption that's why Lassi proposed "flat" JSON format suitable for streaming and low-memory footprint.

The jsonstreamer decorator I used in DAS: https://github.com/dmwm/DAS/blob/master/src/python/DAS/web/das_web_srv.py#L714 https://github.com/dmwm/DAS/blob/master/src/python/DAS/web/tools.py#L156 and it is available in WMCore: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/ReqMgr/Web/tools.py#L160

This code is based on studies I had with Lassi, and can be adopted to DBS APIs.

On 0, Brian P Bockelman notifications@github.com wrote:

Is it possible the issue is not the size of the lumi information but rather how we are loading it?

That is, any web server worth its salt should be able to easily handle a 200MB POST -- however, it's going to be extremely difficult to manage such a thing if all 200MB have to be buffered to memory at once! @vkuznet - does the frontend need to load the full POST before it can start proxying the request to the remote side?

A few thoughts:

How many APIs (or API implementations) need "fixed"?

Do we need to switch to a streaming JSON decoder/encoder? Is there a reason to render the whole structure in memory inside DBS?

If we are treating the lumi information as opaque blobs, why not compress them and never fully decompress on the server side?

Looks like a little medicine in the implementation might be able to go a long way, especially with respect to a streaming JSON decoder.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/599#issuecomment-484514555

vkuznet commented 5 years ago

Here is fully working example of jsonstreamer (save it as jsonstreamer.py):

#!/usr/bin/env python
import cherrypy
import json
from json import JSONEncoder

def jsonstreamer(func):
    """JSON streamer decorator"""
    def wrapper (self, *args, **kwds):
        """Decorator wrapper"""
        cherrypy.response.headers['Content-Type'] = "application/json"
        func._cp_config = {'response.stream': True}
        data = func (self, *args, **kwds)
        yield '{"data": ['
        if  isinstance(data, dict):
            for chunk in JSONEncoder().iterencode(data):
                yield chunk
        elif  isinstance(data, list) or isinstance(data, types.GeneratorType):
            sep = ''
            for rec in data:
                if  sep:
                    yield sep
                for chunk in JSONEncoder().iterencode(rec):
                    yield chunk
                if  not sep:
                    sep = ', '
        else:
            msg = 'jsonstreamer, improper data type %s' % type(data)
            raise Exception(msg)
        yield ']}'
    return wrapper

@jsonstreamer
def test(data):
    return data

data = {"foo":1, "bla":[1,2,3,4,5]}
print('JSON dumps')
print(json.dumps(data))
print('JSON stream')
for chunk in test(data):
    print(chunk)

Now if you'll run it python ./jsonstreamer.py you'll get the following output

JSON dumps
{"foo": 1, "bla": [1, 2, 3, 4, 5]}
JSON stream
{"data": [
{
"foo"
:
1
,
"bla"
:
[1
, 2
, 3
, 4
, 5
]
}
]}

Now we only need to write server side which will read chunks and then compose JSON object.

vkuznet commented 5 years ago

You can check it out with more sophisticated nested python structure, e.g.

rdict = {"fl":[1,2,3], 'name': 'bla'}
data = {"foo":1, "nested":[rdict for _ in range(10)]}
print('JSON dumps')
print(json.dumps(data))
print('JSON stream')
for chunk in test(data):
    print(chunk)

but I will not paste the output of this since it is kind of big.

vkuznet commented 5 years ago

And, now I completed full example, you can see it here: https://gist.github.com/vkuznet/e90b5a7cc92005df7d33877abde3206f

It provides the following:

jsonstreamer decorator
test function which uses this decorator
example based on StringIO to hold json stream
decoder to read json stream
function to measure memory usage of objects

If you'll run the code you'll get the following output:

JSON dumps
{"foo": 1, "bla": [1, 2, 3, 4, 5]}
size: 592
JSON stream
{"foo": 1, "bla": [1, 2, 3, 4, 5]}
size: 71
decoded output
{"foo": 1, "bla": [1, 2, 3, 4, 5]}
size: 908

So even in this basic example the original dict {"foo": 1, "bla": [1, 2, 3, 4, 5]} consumes 592 bytes, its json stream representation consumes only 71 bytes, while decoded object consumes 908 bytes. As you can see json stream is 8x smaller of original object and 12x smaller of decoded one. You may ask relevant question why decoded object is larger than original one. The answer is related to the way how python allocates the memory (in short it allocates more than necessary). Fee free to use more sophisticated/realistic DBS dicts to see the numbers.

vkuznet commented 4 years ago

The corresponding PR which provides support for different input formats can be found here: https://github.com/dmwm/DBS/pull/618

dmwm / DBS

DBS API BulkBlock input size control #599