dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

MSUnmerged: Implement Zipped I/O streams #11036

Closed todor-ivanov closed 2 years ago

todor-ivanov commented 2 years ago

Impact of the new feature MSUnmerged

Is your feature request related to a problem? Please describe. In the current service construction we already have several APIs from external services to use. One on which we strongly depend is the RucioConsistencyMonitor for fetching unmerged file lists and RSE statistics. The lists provided from RuicioConMon can grow significantly. A typical example is the case discussed in those few comments of this issue. RucioConMon provides two APIs for downloading the list contents:

So far we are only using the .json interface, but working with single lined json files with sizes of the order of gigabytes is quite difficult, resource consuming (like memory and network), error prone etc. Currently we do not benefit from the zip compression during download. We are not having any zipped I/O streams implemented either, so that we could properly upload the contents in memory.

The feature request in this issue is to implement zip I/O streams in order to alleviate the impact of the huge lists sizes, and avoid hitting K8 pods' resource limitations in the future.

NOTE: Since these are text archives the compression is high. We should consider significant difference between download size and memory size if we happen to upload the whole contents in memory. Even though we should be planning to work in chunks of data all the time. There are well provided python libraries for helping in this, such as shutil.copyfileobj(). Just as a note, here is a non exhaustive list of libraries to be considered during this development: io, zipfile, shutil etc.

Describe the solution you'd like Implement the needed methods for downloading the compressed file on disk as provided by RucoConMon API &format=raw . Then use the proper libraries to implement a zipped I/O stream, for uploading into memory the contents in chinks of configurable data size.

There are already existing two placeholders for these methods to be implemented:

Describe alternatives you've considered Continue using the current json format, and hit some resource limitations from time to time.

Additional context None

todor-ivanov commented 2 years ago

FYI @vkuznet @amaltaro @klannon @ivmfnal

I was quite reluctant in creating this issue, because we are struggling to complete this service development, but I cannot stop the tide. If we continue seeing errors related to those huge Json files we are working with, this will become an important feature to be implemented. I am setting this with an Enhancement label, and putting this in the Q2 developments.

amaltaro commented 2 years ago

Thanks for creating this issue, Todor. However, I see this more like a redesign of the service than enhancement, since it's not going to affect only how we load data in memory, but also the whole internal logic of dealing with lfns, mapping them to the base directory, checking what is locked or not, and so on.

Perhaps the best option to avoid this major effort right now, would be to work in an isolated instance of the service (I guess you have it fully operational in your virtual environment, right?) and try to delete as many no-longer-needed files as we can, such that the next Rucio ConMon unmerged storage dump comes with a much smaller number of files. For instance, we could split that 800MB json dump in 4 or 5 pieces and load it from disk instead of making the Rucio ConMon get call, where each cycle of the service it would fetch a different piece. OR, if you write the script that you were considering to, then we could use that instead of private MSUnmerged itself. I would vote for whatever takes less effort here, of course, provided the risks of making a mistake are close to 0.

ivmfnal commented 2 years ago

To clarify, format=raw is in fact gzipped, not zipped. The difference is that gzip format works in terms of files while zip is good for stream compression. I think we should leave "raw" format as is - gzipped line-by-line list of paths and add 3rd option, format=zip, which would be zipped stream (see https://docs.python.org/3/library/zlib.html#module-zlib)

todor-ivanov commented 2 years ago

Thanks for taking a look @amaltaro !

However, I see this more like a redesign of the service than enhancement, since it's not going to affect only how we load data in memory, but also the whole internal logic of dealing with lfns, mapping them to the base directory, checking what is locked or not, and so on.

Actually I was not seeing it like a major change. Absolutely nothing should be changed in the MSUnmerged.py file itself at this stage. All the changes need to be enclosed in the Service/RucioConMon.py file. We should just implement the zipped I/O stream so that we can download the zipped version of the file list instead of the .json one. And then feed it back to MSUnmerged the normal way. There is only one entry point at the micro service itself, which is here.

todor-ivanov commented 2 years ago

Thanks @ivmfnal

To clarify, format=raw is in fact gzipped, not zipped.

That is an important clarification indeed. I missed to catch that slight detail, but it must be taken into account when we dive deeper int the coding here.

todor-ivanov commented 2 years ago

And just to be even more precise by what I mean when I say "The whole change needs to be enclosed in the Service/RucioConMon.py":

For this Error, the disaster begins not when we try to upload the file into memory for the MSUnmerged object to deal with the data, but rather when the Service/RucioConMon.py module tries to decode the Json file:

 File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/cms/reqmgr2ms/1.0.1.pre5/lib/python3.8/site-packages/WMCore/Services/RucioConMon/RucioConMon.py", line 66, in _getResult
    results = json.loads(results)
...
  File "/data/srv/HG2203c/sw/slc7_amd64_gcc630/external/python3/3.8.2-comp/lib/python3.8/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Unterminated string starting at: line 1 column 168804222 (char 168804221)

Which is basically a sign for a different issue. To me it looks more like a broken session and not completely downloaded Json file. Because, both me and @ivmfnal did manage to download the json file properly and decode it on private machines. It could be also cache size for the Service module to play a role here... and many other things.

ivmfnal commented 2 years ago

I wonder if there is some time-out on the client side, which causes premature download termination

todor-ivanov commented 2 years ago

yet another possibility indeed

vkuznet commented 2 years ago

Somehow I missed the announce of this issue. but it is not to late for me to make few remarks:

klannon commented 2 years ago

Quick comments:

vkuznet commented 2 years ago

The reason why it is wrong approach by using optional argument is the following:

ivmfnal commented 2 years ago

To clarify: currently the following 4 options are implemented:


From: Valentin Kuznetsov @.> Sent: Tuesday, March 22, 2022 1:02 PM To: dmwm/WMCore @.> Cc: Igor V Mandrichenko @.>; Mention @.> Subject: Re: [dmwm/WMCore] MSUnmerged: Implement Zipped I/O streams (Issue #11036)

The reason why it is wrong approach by using optional argument is the following:

— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_WMCore_issues_11036-23issuecomment-2D1075455581&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=mhaN5hb7dRjeKjhUWu7fQs92g2ArcShdUrcSwD6V0BZyo96IGTwTccUX-IwaM5Ww&s=fsOEBPSWBTL3HyZTDjIyoMO0qubTFCG6xIwKLJtIJwU&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AFK4SQREHTWMOKUD7U4DJDTVBIDLBANCNFSM5QO7AFYA&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=mhaN5hb7dRjeKjhUWu7fQs92g2ArcShdUrcSwD6V0BZyo96IGTwTccUX-IwaM5Ww&s=Ctq8sXObhC_EtFWUuNk9D_ex65H44Ho78YJaL6D9MKg&e=. You are receiving this because you were mentioned.Message ID: @.***>

vkuznet commented 2 years ago

@ivmfnal, I'm not sure I understand the implemented choices. My question are:

Especially, it is unclear what format=raw means, from your description it can be gzipped file (of which format?) or uncompressed text stream. How this is possible that single option provides different data-formats?

What should be done is the following:

todor-ivanov commented 2 years ago

Thank you all @klannon @vkuznet @ivmfnal those really good points that you make here.

I am all for following the HTTP standard, so let's please do that if possible.

I believe we are all on the same page here. The only thing that is needed is to agree on the set of combinations between HTTP header types. I need to mention one thing though, @vkuznet what you suggest here:

That's it, the client asks for data in specific format and encoding. Then, your server either acknowledge it or not. If it is acknowledge it, then it will set Content-Type: application/json (or other data format) and Content-Encoding: gzip (or other encoding) such that client will know how to deal with provided data. If your server can't accept requested headers it will provide data in whatever default format it use to and provide proper HTTP headers.

is really a shortcut to a more broader area which is the content negotiation mechanism. I do agree on your suggestion. But just to stay on the safe side and not deviate from best practices lets note two things:

P.S. - One more thing that is worth to be added here too. The WMCore caches, are actually uniquely identifying the cache to be used mostly based on the url, and not including any header information. This comes from the way how we build the name for the cache. Which means if we always use the same url for all encodings returned, we may get unexpected behavior when in previous calls to the server the contents returned have been for a call which was using a different encoding type, if we do not have a way to set a preferred encoding among multiple encodings supported in the client's request. Because the cache is static content with a lifetime associated with it and nothing more, it will not follow the negotiation mechanism that it would with the server. Of course we can use the HTTP header already set in the reply in the cache, but we still need a mechanism to signal the top level method (in this case that would be _getResultsZipped )if the contents of this cache is what it expects. Yet another point on preserving the different uris associated with the different encoding types, which will strongly facilitate whole this. But I'd let @amaltaro to correct me in case I am wrong here.

ivmfnal commented 2 years ago

I do not see why we need to be too pedantic about following HTTP standards here. Proper HTTP headers and representation negotiation would be needed if we were developing a general-purpose server, working with multiple uncontrolled implementations of the client. Clearly this is not the case here, and I do not see why we need to complicate implementation of both sides.

Igor


From: todor-ivanov @.> Sent: Wednesday, March 23, 2022 3:58 AM To: dmwm/WMCore @.> Cc: Igor V Mandrichenko @.>; Mention @.> Subject: Re: [dmwm/WMCore] MSUnmerged: Implement Zipped I/O streams (Issue #11036)

Thank you all @klannonhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_klannon&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=3Yafq6DAWmyEvTc_og5V8mjbjLtBbmyPkglvKL5hSTSLqZVB3Zy79Pyd56_BJCQv&s=SgrOOvX5B3wtTUHeOU5kmc0Si8qGZBEPPKHVgRktLeE&e= @vkuznethttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_vkuznet&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=3Yafq6DAWmyEvTc_og5V8mjbjLtBbmyPkglvKL5hSTSLqZVB3Zy79Pyd56_BJCQv&s=KQCMpt-Z_5k13FlbLeJFEJhAVvcXFXH5tzT0wl97CSM&e= @ivmfnalhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ivmfnal&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=3Yafq6DAWmyEvTc_og5V8mjbjLtBbmyPkglvKL5hSTSLqZVB3Zy79Pyd56_BJCQv&s=ARtN_L3FhnjofLlMTGWwf3pRpgLVszmqKiTdaegaiz8&e= those really good points that you make here.

I am all for following the HTTP standard, so let's please do that if possible.

I believe we are all on the same page here. The only thing that is needed is to agree on the set of combinations between HTTP header types. I need to mention one thing though, @vkuznethttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_vkuznet&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=3Yafq6DAWmyEvTc_og5V8mjbjLtBbmyPkglvKL5hSTSLqZVB3Zy79Pyd56_BJCQv&s=KQCMpt-Z_5k13FlbLeJFEJhAVvcXFXH5tzT0wl97CSM&e= what you suggest is really a shortcut to a more broader area which is the content negotiation mechanismhttps://urldefense.proofpoint.com/v2/url?u=https-3A__developer.mozilla.org_en-2DUS_docs_Web_HTTP_Content-5Fnegotiation&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=3Yafq6DAWmyEvTc_og5V8mjbjLtBbmyPkglvKL5hSTSLqZVB3Zy79Pyd56_BJCQv&s=SQHbvuQCAWer8Bj484h3O7hOpbOon1-3W1GRocgDu3o&e=. I do agree on your suggestion. But just to stay on the safe side and not deviate from best practices lets note two things:

P.S. - One more thing that is worth to be added here too. The WMCore caches, are actually uniquely identifying the cache to be used mostly based on the url, and not including any header information. This comes from the way how we build the name for the cache. Which means if we always use the same url for all encodings returned, we may get unexpected behavior when in previous calls to the server the contents returned have been for a call which was using a different encoding type, if we do not have a way to set a preferred encoding among multiple encodings supported in the client's request. Because the cache is static content with a lifetime associated with it and nothing more, it will not follow the negotiation mechanism that it would with the server. Of course we can use the HTTP header already set in the reply in the cache, but we still need a mechanism to signal the top level method (in this case that would be _getResultsZippedhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_WMCore_blob_b77169aa015795d8725dd856f49023af9995549d_src_python_WMCore_Services_RucioConMon_RucioConMon.py-23L69-2DL76&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=3Yafq6DAWmyEvTc_og5V8mjbjLtBbmyPkglvKL5hSTSLqZVB3Zy79Pyd56_BJCQv&s=osi8u2gsOZxlA9w1WI6d0Oql1Dxl8KgWcOUHYG5SYig&e= )if the contents of this cache is what it expects. Yet another point on preserving the different uris associated with the different encoding types, which will strongly facilitate whole this. But I'd let @amaltarohttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_amaltaro&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=3Yafq6DAWmyEvTc_og5V8mjbjLtBbmyPkglvKL5hSTSLqZVB3Zy79Pyd56_BJCQv&s=bLtM_bAoUo_emUuHUpWog3nfAzdh15fqUE7P6npgulA&e= to correct me in case I am wrong here.

— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_WMCore_issues_11036-23issuecomment-2D1076112447&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=3Yafq6DAWmyEvTc_og5V8mjbjLtBbmyPkglvKL5hSTSLqZVB3Zy79Pyd56_BJCQv&s=ktiwk4tVdnSk_fj3FkfDytR4iogAE4wuInxcmwTadac&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AFK4SQVYP75H557NXELHS6TVBLML3ANCNFSM5QO7AFYA&d=DwMFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=3Yafq6DAWmyEvTc_og5V8mjbjLtBbmyPkglvKL5hSTSLqZVB3Zy79Pyd56_BJCQv&s=82pfxotBtZvMRcI6XT5A0ub3YBJyMekJef-jA1pX72Y&e=. You are receiving this because you were mentioned.Message ID: @.***>

vkuznet commented 2 years ago

@ivmfnal , how do you define general-purpose server? In other words, any server which relies on HTTP protocol is general-purpose server regardless of what it is doing.

ivmfnal commented 2 years ago

But not every server has to implement representation negotiation.


From: Valentin Kuznetsov @.> Sent: Wednesday, March 23, 2022 6:42 AM To: dmwm/WMCore @.> Cc: Igor V Mandrichenko @.>; Mention @.> Subject: Re: [dmwm/WMCore] MSUnmerged: Implement Zipped I/O streams (Issue #11036)

@ivmfnalhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ivmfnal&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=P9ZMa1TYl0I7K2rOCFFDcLy1BtE-Ytbxpg0U2aQl4soDUtJwVkB0uR8gdKTqN11C&s=N75XpBuUllBhnld5xUc5CpT_jaA_fpcbnHw8iFFArqc&e= , how do you define general-purpose server? In other words, any server which relies on HTTP protocol is general-purpose server regardless of what it is doing.

— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_WMCore_issues_11036-23issuecomment-2D1076275821&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=P9ZMa1TYl0I7K2rOCFFDcLy1BtE-Ytbxpg0U2aQl4soDUtJwVkB0uR8gdKTqN11C&s=KPVbwu0D_N_gsnJoYoyQt6oHnfXGaQyCn7nBtYUmH4c&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AFK4SQVG72UKQVZYZOXSQGLVBL7SVANCNFSM5QO7AFYA&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=xVVABFB8tmUPsqeRvA-B6A&m=P9ZMa1TYl0I7K2rOCFFDcLy1BtE-Ytbxpg0U2aQl4soDUtJwVkB0uR8gdKTqN11C&s=R5369mg1tP7wS-2d7aqq-kCFLxgOE5MHCXhHke-UGLk&e=. You are receiving this because you were mentioned.Message ID: @.***>

vkuznet commented 2 years ago

@todor-ivanov , your first bullet item does not require clients to stick to single encoding. A client can choose one or another, therefore all of them have their freedom to request specific encoding. The limitation comes from a server to support them. Now, your pointer to Requests does not mean that HTTP server will only be accessed through this class. We may change clients, or even change usage of Python to access HTTP server. As such server implementation should be independent from a choice of client and its settings.

For your second bullet item, I don't know the use-case here. Why do you need multiple encoding, and multiple data-formats to support in clients? I don't really understand that purpose of them.

Please outline use-cases here. For example:

To me, this discussion is useless without concrete use-cases. In DMWM land we use everywhere JSON as data-format which can be either gzipped or zipped. So, we need one data-format and one encoding for it to compress the data. Unless I miss something I don't understand the point of this discussion and multiple implementations for different formats/encodings.

todor-ivanov commented 2 years ago

Thakns @vkuznet

Me myself do not want to go deep into general discussions here. As I already mentioned:

I do agree on your suggestion.

I was just making a note about few cases we will not be able to cover and we will not be following all best practices by the book, but that's fine here. I am not advocating we should waste time implementing all the stuff right now. I still stay behind this:

The only thing that is needed is to agree on the set of combinations between HTTP header types.

Once we merge your PR for the zipped I/O in pycurl_manager I will immediately rebase on top of it and use it here.

todor-ivanov commented 2 years ago

Writing it here as a reminder. There is a change in RucioConMon that needs to be reverted once we are done with the cuurent issue.

Last time when we were having troubles with the large JSON file for RAL we requested a change at RucioConMon - to fiter out /store/unmerged/logs/ branch. The change was done at their frontend and it also provides the ability to filter some parts of the tree through the API calls [1] , which is indeed good, but the filter we requested above is hardwired, until we fix the issue for implementing the Zipped I/O streams. Once we are done we must request this filter to be dropped from RucioConMon and rely on our mechanism to filter the input list. This way we will continue cleaning the /store/unmerged/logs branch as before. If at some point we decide to take advantage of the API extensions Igor have provided, that would be another story and to be tracked in a separate issue.

[1] https://github.com/dmwm/WMCore/issues/10982#issuecomment-1065340861