dmwm / DBS

CMS Dataset Bookkeeping Service
Apache License 2.0
7 stars 21 forks source link

DBS throttling #607

Open yuyiguo opened 5 years ago

yuyiguo commented 5 years ago

@vkuznet @bbockelm @amaltaro @belforte @h4d4 Hi Everyone, I tested two versions of throttling while working with Valentin. V1: ported from crab, V2: time based approach. Valentin made a PR for both versions in [2]. I used a hey tool written by Valentin using Go to create the concurrent access to DBS server on my VM. I had 30 threads for DBS sever. I used DBS API datatiers with a sleep of 0 to 3 second in side the API to simulate the various calls. Here is an example: /afs/cern.ch/user/v/valya/public/hey_linux -n 1000 -c 200 -U ./url.txt url.txt is a file that has the urls to call, in this case it is https://dbs3-test2.cern.ch/dbs/dev/global/DBSReader/datatiers?data_tier_name=LHE this command send 200 concurrent datatiers API calls to DBS and the total calls to send is 1000. After the call hey tool will give stats. Please see v2-t1-l100-n1000-c200.txt for the output.

I tested different combination, please see [1] for the results. limit is the number of concurrent calls, trange is the time range in V2.

[1] https://docs.google.com/spreadsheets/d/1pS0EHMglx67a-ktowDBiLBm1YbFZbSG8130iNbLHkrc/edit?usp=sharing [2] https://github.com/dmwm/WMCore/pull/9211#issuecomment-495738944

yuyiguo commented 5 years ago

Please feel free to ask questions or make comments about the tests. Any suggestion on what the limit we should place on DBS is welcomed! The tests were made on my VM, it will change when we move to int or prod. I will make more tests on int next week, But we got ideas on how the limits/throttling affect the output. BTW, this [PR]( dmwm/WMCore#9211 (comment)) changed the 500 error code to 429 that is what it should be. I did not update my local copy while doing tests. So the 500 in the table is the place holder for 429 that is "too many requests".

amaltaro commented 5 years ago

Hi Yuyi,

I have a few observations/questions to make, based on the gdoc results:

I'm not sure what to conclude out of those results, besides saying that V2 denies much more requests.

I see you requested a new DBS deployment to testbed. Which throttling version is it using?

I'm very concerned about limiting DBS requests concurrency to only 3 (limit=3), reason being that we share the same service certificate among many WMAgent nodes (~10 nodes). Another case like that are the CMSWEB services calling DBS, those will all be made with the DMWM service certificate, even though their source might be from different services.

Ideally speaking, I think we should have a short list of services that deserve a higher limit (like any CMS service used in production), of course we still want to protect DBS from those; then a very small limit for anything else, like grid jobs or anyone else creating a script to monitor/analyze DBS data somehow. I know this gives more code maintenance and not the best performance, but it might be considered in the near future in order to provide a smooth production activity.

h4d4 commented 5 years ago

Hello Yuyi,

I've a comment regarding environment in which tests were executed. If those request against DBS are following fronted-backend workflow and metrics such as: 'Requests/sec', 'Response time' , 'Latency', etc, have a high impact for the analysis of this test.

Please keep in mind that a VM have the fronted and the backed in the same node, while for example in cmsweb-testbed, fronteds and backends are in different machines. Therefore, if the same test would be run in the testbed is probably that value for those metrics could increase.

Best Regards, Lina.

vkuznet commented 5 years ago

Alan, let me answer few questions in line:

On 0, Alan Malta Rodrigues notifications@github.com wrote:

Hi Yuyi,

I have a few observations/questions to make, based on the gdoc results:

  • Where does the client error distribution comes from? E.g. the one at line 10, is that 135 + 188 client errors out of the 568 500 responses?

The stats comes from hey tool itself. It works by generating concurrent requests and collect the results back. If request arrives as 200 OK it will be used for break-down metrics, e.g. req/sec, while if response fail to come or there is an error it will be counted. The counters are per error type. Therefore in the table 135 requests timeouts, while 188 never happen because the connection were not established (again because underlying server were not responding).

  • what's the unit of trange? secs or min?

it is in seconds

  • what happens when the server crashes? can you still ping or ssh to it? or you need to reboot the node via openstack interface? I'm curious to know whether it's a limitation of a very small private VM or something else nuking and breaking the machine.

Yuyi should tell us exactly what happen.

I'm not sure what to conclude out of those results, besides saying that V2 denies much more requests.

I see you requested a new DBS deployment to testbed. Which throttling version is it using?

I'm very concerned about limiting DBS requests concurrency to only 3 (limit=3), reason being that we share the same service certificate among many WMAgent nodes (~10 nodes). Another case like that are the CMSWEB services calling DBS, those will all be made with the DMWM service certificate, even though their source might be from different services.

I agree with that and for that we can introduce white list of users/DN. If everybody agree I can add additional code for that (it should be trivial).

Ideally speaking, I think we should have a short list of services that deserve a higher limit (like any CMS service used in production), of course we still want to protect DBS from those; then a very small limit for anything else, like grid jobs or anyone else creating a script to monitor/analyze DBS data somehow. I know this gives more code maintenance and not the best performance, but it might be considered in the near future in order to provide a smooth production activity.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/607#issuecomment-496133193

yuyiguo commented 5 years ago

@amaltaro @h4d4 @vkuznet Let me answer questions by by the order asked. Regarding the table at line 10: this is for test " hey -n 1000 -c 200 limit=3" which run 200 concurrent tests and total number of tests aew 1000, the limit=3 and with version 1 of throttling code. So we expected "hey" gave us the status of all the 1000 calls made. This test listed in line 8 and column B:

[200]        30 responses
[500]        568 responses
[502]        79 responses
server dead, machines crashed

30 calls ended with code 200, 568 calls ended with 500 ... So total calls was responded were 30+568+79=677, then at line 10 there showed 135+188=323 calls were timeout. After all 323+677=1000 calls were call reported.

When the server and node crashed, my ssh to the node was logged out automatically I had to wait for a long time for the node recover itself , then I could ssh back. Sometimes I just went to openstack to hard restart the node, soft restart would not work. Hard restart would be a bit quickly then waiting for it self-recovery. That's why these tests took long time.

The current deployment on cmsweb-testbed was v1.

I agree with the whitelist too. I also propose that we have a back list too. So we will be in better control of the system. Valentin, please add the additional code.

I also agree with Lina on the performance issue. I will test on cmsweb-testbed today. Yuyi

yuyiguo commented 5 years ago

@amaltaro @h4d4 @vkuznet @belforte @bbockelm

Please see tests' results on cmsweb_tested.

The tests were for v1. The first columns (B,C) were identical test conditions and tested with API datasets and datatiers. The columns (D,E) were identical test conditions and tested with API datasets, datatiers, filelumis (with lfn and dataset) and filesummaries.

As you can see that the performance of the testing on cmsweb-testbed is much better than my VM. I no longer got any server dead or node crashed. Until row 9 was 100% executed. One thing I did not quite understand was that the mixed APIs performed better than the simple datasets and datatiers that supposed to run faster.

yuyiguo commented 5 years ago

I added more tests showed in 9-FG, 10-FG and 10-HIJ.
9-FG and 10-FG were both running two "hey" commands at the same time. 10-HIJ ran three.

bbockelm commented 5 years ago

@yuyiguo - looking over the Google Docs, I see:

  1. When the concurrency in the client (hey) is set to N and the concurrency limit in the server is set to M, for throttling V1, if N>M, errors start to appear. This is desired and the point of the throttle, correct?

    • The exceptions for throttling V1 are when N=20 and M=20 (where 8% of queries got an error) as well as when M=3 and N=5 or 10.
  2. For throttling V2, I'm having a harder time unraveling the results. The performance on the low end (M=3) looks better but, for example, when the limit was set to 20 (M=20), there were no errors when the client used 100 parallel threads (N=200).

Is this the correct reading of the spreadsheet?

vkuznet commented 5 years ago

Brian,

On 0, Brian P Bockelman notifications@github.com wrote:

@yuyiguo - looking over the Google Docs, I see:

  1. When the concurrency in the client (hey) is set to N and the concurrency limit in the server is set to M, for throttling V1, if N>M, errors start to appear. This is desired and the point of the throttle, correct?

    • The exceptions are when N=20 and M=20 (where 8% of queries got an error) as well as when M=3 and N=5 or 10.

this seems to me is correct observation

  1. For throttling V2, I'm having a harder time unraveling the results. The performance on the low end (M=3) looks better but, for example, when the limit was set to 20 (M=20), there were no errors when the client used 100 parallel threads (N=200).

here, and more in general, the reading should be more complex. We should not operate with N,M, but also take into account number of threads in DBS server and how it affects the throttling in both implementation.

So, when client has N concurrent calls, the throttling throughput should be defined wrt M x total number of DBS server threads since N concurrent calls will hit entire DBS server. If this argument makes sense, the higher M limit times number of DBS server threads do not trigger throttling threshold in V2. This is because V2 counts number of calls within time range and counter is not blocked by API execution. Therefore, when we increase time interval within the same limit we see increase of total rejection.

While in V1 case the counter is constrained by API execution time, therefore the nigher number of concurrent client calls will lead to increasing number of total rejection.

Is this the correct reading of the spreadsheet?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/607#issuecomment-496934329

bbockelm commented 5 years ago

Hi Valentin,

I am still having trouble understanding the V2 approach. As I mentioned in PR #9211, with the current logic and trange>1, it appears perfectly acceptable for a single client to block all threads for the entire server in perpetuity.

Why do we care about limiting the number of queries in a certain time range? If the user figures out a way to express their complex logic with a number of short, lightweight DBS queries -- great! The current logic somewhat encourages the user to have a small number of heavy queries that block the server.

Another way to look at it -- if a single query lasts for more than 1 second (if trange=1), then it is "free" in the V2 model (the counter is reset once a second) but very costly to the user in the V1 model. If trange=1, then 15 queries that each last 5 minutes are acceptable in V2, assuming they are done more than 1 second apart. In the V1 model, if limit = 5, the 5 minute queries should be halted after 5.

The operating theory behind V1 is that the most precious resource is a server thread and limits the number of server threads a single identity (DN) can occupy at once. This reduces the ability of a single actor to consume all the resources.

Brian

belforte commented 5 years ago

IMHO this is an operational decision, not code. Valentin found it easy/quick/whatever to offer the possibility to choose between two different policies. While the use case for V1 is clear for things like protecting against too many threads in Unified, it is not immediately obvious that we want or need V2, the target use case here is "people accessing a central server from grid jobs" which we may want to discourage even if all in all at this very moment the system could stand them.

yuyiguo commented 5 years ago

@bbockelm Brian, Your observation of regrading the concurrency in the client (hey) N and concurrency limit in the server M was correct with the initial test on my VM as sheet1. However if we look at sheet2 that is done on cmsweb_testbed that has two nodes, all calls succeed when M=3, N=5/10/20/50 with total 100 calls. Then when the client concurrency(N=5/20/50) and the concurrency limit(M=3) were kept the same, but the total number of calls increased, 1% calls was blocked. Here I saw the time concept started to play a role.

Lina just deployed DBS with M=10, I am testing with it now.

vkuznet commented 5 years ago

Brian, V1 approach limits users on heavy APIs, as you said if user A send 5 queries which takes 5 min each it will be blocked, but if user B send 10k short lived queries they will be ok. What about if user B send 10k-100k queries and occupies all DBS queue and may block other users at that time. The V1 approach does not care about that because the server queue will be full by single user and until his queries are cleared up others will have no access to the server.

V2 approach limits users by number of calls they made in time range N seconds. (I don't think we should stick with 1 sec per se though). But it allows to spot that user B sent too many queries in given time interval and we want to reduce his/her capacity in order to give other users access to DBS. Please note, that I may be wrong in implementation but that was the idea.

I think it outlines two different approaches to user access patterns and sustainability of server in such condition. What we should use is our choice.

But what we need is to find through benchmark tests which parameters to use to address (probably) both situations since we see them in DBS, i.e. we have users whose queries consume lots of time in DBS, and we have users who spread their queries and hit DBS at once and occupy all server capacity.

Valentin

On 0, Brian P Bockelman notifications@github.com wrote:

Hi Valentin,

I am still having trouble understanding the V2 approach. As I mentioned in PR #9211, with the current logic and trange>1, it appears perfectly acceptable for a single client to block all threads for the entire server in perpetuity.

Why do we care about limiting the number of queries in a certain time range? If the user figures out a way to express their complex logic with a number of short, lightweight DBS queries -- great! The current logic somewhat encourages the user to have a small number of heavy queries that block the server.

Another way to look at it -- if a single query lasts for more than 1 second (if trange=1), then it is "free" in the V2 model (the counter is reset once a second) but very costly to the user in the V1 model. If trange=1, then 15 queries that each last 5 minutes are acceptable in V2, assuming they are done more than 1 second apart. In the V1 model, if limit = 5, the 5 minute queries should be halted after 5.

The operating theory behind V1 is that the most precious resource is a server thread and limits the number of server threads a single identity (DN) can occupy at once. This reduces the ability of a single actor to consume all the resources.

Brian

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/607#issuecomment-496961904

yuyiguo commented 5 years ago

@vkuznet Hi Valentin,

There are two kinds of timeout in the hey tool. One is hey times out for waiting for the server send back the result after submission and the other is hey times out if hey cannot submit the request. Do I understand correctly? If so, could you let me know how long each timeout is? Are the values adjustable on the command line? Yuyi

yuyiguo commented 5 years ago

I tested with the client concurrency limit=10 on cmsweb-testbed. Please see sheet3.

The result show that the throttling was not triggered with 250 concurrent clients and total 4000 API calls. There are 11 calls of these 4000 were timeout. I am thinking to make the concurrency limit=20 for next Tuesday's production release. We have 5 production nodes and each has 30 threads. 30threads X 20/threads = 600 concurrent calls/user/node. So 20 should be big enough for not hit the limit?

I checked that the top caller was Dynamo that made 111.1K calls/hour at 1:00 May 14. Not sure how the 111k calls distributed in the hour. If it hit the throttling limit, it would crash DBS anyway.

bbockelm commented 5 years ago

What about if user B send 10k-100k queries and occupies all DBS queue and may block other users at that time.

What queue are they blocking? If you're worried about the accept queue -- that's at the TCP level and before any of the throttling is applied, hence the users can block the queue with or without the throttling (especially if it's from grid jobs and they aren't noticing the failures!).

I am not worried at all about 100k queries if they are doing them from a single thread. I'm very, very worried if they are doing 10k queries from 1,000 worker nodes!

The result show that the throttling was not triggered with 250 concurrent clients and total 4000 API calls.

This is a strange result. Potentially the frontend cannot process quickly enough to load up the backend?

We have 5 production nodes and each has 30 threads. 30threads X 20/threads = 600 concurrent calls/user/node

No. 5 production nodes x 30 threads is 150 threads total. That means up to 150 concurrent calls total. Given each user can have at most 20 active calls per backend, that means one user can have up to 100 concurrent calls. So, 2 very badly behaved users can block the cluster.

Anyhow, I think 20 is a fine setting, assume it is a "V1"-style throttle. There is some room for us to decrease it as needed. Probably best to be a bit more forgiving and only decrease it if experience shows it is necessary. This would have protected the cluster from the issue with the migration agent, for example.

vkuznet commented 5 years ago

Yuyi, you're correct. The first one, the timeout hey awaits when server sends back the results. The second one, when hey can't establish connection since server is busy.

According to the hey code it is controlled by

  -t  Timeout for each request in seconds. Default is 20, use 0 for infinite.

I don't see another timeouts in a code.

Best, Valentin.

On 0, Yuyi Guo notifications@github.com wrote:

@vkuznet Hi Valentin,

There are two kinds of timeout in the hey tool. One is hey times out for waiting for the server send back the result after submission and the other is hey times out if hey cannot submit the request. Do I understand correctly? If so, could you let me know how long each timeout is? Are the values adjustable on the command line? Yuyi

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/607#issuecomment-497345250

yuyiguo commented 5 years ago

Hi Valentin,

  -t  Timeout for each request in seconds. Default is 20, use 0 for infinite.

Could you update the code with separated controlling on two timeouts? I'd like to see hey wait for 300 seconds on getting results back from the server, but waiting for 300 seconds to send a request seem not a realistic test. Thanks, Yuyi

yuyiguo commented 5 years ago

Currently we are using v1 style throttling. Thanks Brain for the explaining the threads counting. Lina, I will update the deployment to set the throttling limit to 20.

We have a powerful 5th node in DBS, but it is used the same as the others. Is it possible that we increase its threads from 30 to 40, then add it twice in FE round robin list of nodes? I am not sure how exactly the FE assign requests to BE, this is just an idea to get our resource fully used.

yuyiguo commented 5 years ago

I retested "250 concurrent clients and total 4000 API calls" against cmsweb-testbed that had a throttling limit=10. The result confirmed the previous test with even better result as in the line 16 of sheet3 . Attached is the output of the hey and it shows that every call got result back except one was timeout. [Uploading v1-n4000-c250.txt…]( hey results from -n4000 -c 250)

vkuznet commented 5 years ago

We have a powerful 5th node in DBS, but it is used the same as the others. Is it possible that we increase its threads from 30 to 40, then add it twice in FE round robin list of nodes? I am not sure how exactly the FE assign requests to BE, this is just an idea to get our resource fully used.

Yuyi, the change round-robin in FE is not that easy. It requires changing FE cmshost.pem module, see https://github.com/dmwm/deployment/blob/master/frontend/cmshosts.pm#L242-L249 and to assign more weights on particular backend will require significant code changes, e.g. we need to introduce weights and then use them in a code. Then changes will be required to entire deployment procedure which depends on prod/pre-prod/dev instances, etc.

I don't think we need it desperately now and I rather wait until we'll migrate to k8s where dynamic pod provisioning will be possible.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/607#issuecomment-497734596

vkuznet commented 5 years ago

Yuyi, since I didn't write the code in a first place (I only adapted to use X509 proxies) it will require careful studying the entire structure. I can do it eventually, but I can't guarantee that it will be soon since I have more high-priority tasks through the summer. Valentin.

On 0, Yuyi Guo notifications@github.com wrote:

Hi Valentin,

  -t  Timeout for each request in seconds. Default is 20, use 0 for infinite.

Could you update the code with separated controlling on two timeouts? I'd like to see hey wait for 300 seconds on getting results back from the server, but waiting for 300 seconds to send a request seem not a realistic test. Thanks, Yuyi

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/607#issuecomment-497729731

yuyiguo commented 5 years ago

Thanks Valentin for explaining the FE code. Yes, we can wait for k8s. Yuyi

From: Valentin Kuznetsov notifications@github.com Reply-To: dmwm/DBS reply@reply.github.com Date: Friday, May 31, 2019 at 2:58 PM To: dmwm/DBS DBS@noreply.github.com Cc: Yuyi Guo yuyi@fnal.gov, Mention mention@noreply.github.com Subject: Re: [dmwm/DBS] DBS throttling (#607)

We have a powerful 5th node in DBS, but it is used the same as the others. Is it possible that we increase its threads from 30 to 40, then add it twice in FE round robin list of nodes? I am not sure how exactly the FE assign requests to BE, this is just an idea to get our resource fully used.

Yuyi, the change round-robin in FE is not that easy. It requires changing FE cmshost.pem module, see https://github.com/dmwm/deployment/blob/master/frontend/cmshosts.pm#L242-L249https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_deployment_blob_master_frontend_cmshosts.pm-23L242-2DL249&d=DwQCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=GCFgP69x9HY_zJ4H7EKpr1Q5blbjPj8FiRFDSHl9Uxo&s=Tg_kfZLTTGUhLsVpkCgu7T5qHmpIp9UZfC5CPnXXB1c&e= and to assign more weights on particular backend will require significant code changes, e.g. we need to introduce weights and then use them in a code. Then changes will be required to entire deployment procedure which depends on prod/pre-prod/dev instances, etc.

I don't think we need it desperately now and I rather wait until we'll migrate to k8s where dynamic pod provisioning will be possible.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/607#issuecomment-497734596https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_DBS_issues_607-23issuecomment-2D497734596&d=DwQCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=GCFgP69x9HY_zJ4H7EKpr1Q5blbjPj8FiRFDSHl9Uxo&s=8JlFGO0uEKWFdXlCwPBo3272c8ppa_EJEgI4GCY7Pok&e=

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_DBS_issues_607-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAANROTRZ72NKOL2ARMRQZKLPYF7MVA5CNFSM4HPTMEHKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWWHJNY-23issuecomment-2D497841335&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=GCFgP69x9HY_zJ4H7EKpr1Q5blbjPj8FiRFDSHl9Uxo&s=b4ScIJB_WZNRV1Ij2Hizeci2x4_g1vE4qE0DPASCqxA&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AANROTTQX7PG2LAKLU6Y5P3PYF7MVANCNFSM4HPTMEHA&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=8bursUuc0V63OwREQMBG2Q&m=GCFgP69x9HY_zJ4H7EKpr1Q5blbjPj8FiRFDSHl9Uxo&s=MQ6os0iQVFAk15cnFA-u8pi4iqIw1wAYgQkvknL0idc&e=.

amaltaro commented 5 years ago

Keep in mind that Lina is still commissioning 3 other general purpose backends. Which will bring us to 4 GP backends with twice'ish the resources as the previous backends (which will be retired once the transition is over).

yuyiguo commented 5 years ago

Great news that Lina is already preparing 3 other new powerful general purpose backends. We did not plan for retiring the current 4 general purpose backends used by DBS and others without DBS storm. So these machines are still in their lifespan and warranty? Lina already successfully put in the first new general purpose machine in the backends. My question is do we have to retire the four old BE machines? If not, could we keep the other service on these BEs and just move DBS into the 3 new BEs? Would moving DBS only cause any technique difficulty?

vkuznet commented 5 years ago

Yuyi, my understanding that new general purpose backends will be provided/used for all services and once new BE will be deployed we'll give the old one back. Please understand that all new machines come from CMS quota and we are not get them for free.

Therefore, I doubt that your questions are valid. New general purpose backends will be used for all services, but yes it will be more powerful nodes.

Valentin.

On 0, Yuyi Guo notifications@github.com wrote:

Great news that Lina is already preparing 3 other new powerful general purpose backends. We did not plan for retiring the current 4 general purpose backends used by DBS and others without DBS storm. So these machines are still in their lifespan and warranty? Lina already successfully put in the first new general purpose machine in the backends. My question is do we have to retire the four old BE machines? If not, could we keep the other service on these BEs and just move DBS into the 3 new BEs? Would moving DBS only cause any technique difficulty?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/dmwm/DBS/issues/607#issuecomment-498258652

belforte commented 5 years ago

Those are not physical machines, but allocations from a quota of virtual machines that "stay" while underlaying hardware comes and goes as it ages. Whether DBS and other backends can live on same VM's or really need to be moved to separate VM instance is a question that should be answered w/o a specific hardware configuration in mind. Then Lina and Caio can provision the resources. Yet it seems that after Unified was tamed and SQLAlchemy version problem understood, we are in same good shape as last N years. So this may be an item for the k8s setup, I do not know if the "VM" concept is still useful there.