DIRACGrid / diracx

The neXt DIRAC incarnation
GNU General Public License v3.0
8 stars 20 forks source link

Diracx demo crashes #274

Open martynia opened 3 months ago

martynia commented 3 months ago

Hello, The demo crashes w/o any interactions from my client. A freshly cloned demo:

2024-07-09 08:09:37,364 - ERROR: Traceback (most recent call last): File "/opt/conda/lib/python3.11/site-packages/starlette/routing.py", line 732, in lifespan async with self.lifespan_context(app) as maybe_state: File "/opt/conda/lib/python3.11/contextlib.py", line 210, in aenter return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/diracx_source/diracx/diracx-routers/src/diracx/routers/fastapi_classes.py", line 35, in lifespan await asyncio.gather( File "/opt/conda/lib/python3.11/contextlib.py", line 650, in enter_async_context result = await _enter(cm) ^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/contextlib.py", line 210, in aenter return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/diracx_source/diracx/diracx-routers/src/diracx/routers/job_manager/sandboxes.py", line 62, in lifetime_function if not await s3_bucket_exists(self._client, self.bucket_name): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/diracx_source/diracx/diracx-core/src/diracx/core/s3.py", line 29, in s3_bucket_exists return await _s3_exists(s3_client.head_bucket, Bucket=bucket_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/diracx_source/diracx/diracx-core/src/diracx/core/s3.py", line 39, in _s3_exists await method(**kwargs) File "/opt/conda/lib/python3.11/site-packages/aiobotocore/client.py", line 411, in _make_api_call raise error_class(parsed_response, operation_name) botocore.exceptions.ClientError: An error occurred (403) when calling the HeadBucket operation: Forbidden

martynia commented 2 months ago

@chrisburr @chaen.

chaen commented 2 months ago

It's expected upon creation, the pod should redeploy itself and everything should be fine then

martynia commented 2 months ago

OK, but shouldn't the success be visible in the logs ? kubectl get pods shows the demo pod ready 1/1 But the logs: INFO: Stopping reloader process [1]

How long do I have to wait for redeployments ? 10 minutes have passed now, and the logs don't show any sign of life. I'm pretty sure I waited longer that 10 minutes with no success.

martynia commented 2 months ago

I think it is properly dead. I has never restarted after 15:07 GMT

chaen commented 2 months ago

what does kubectl get pods show ?

martynia commented 2 months ago
martynia@Januszs-Laptop Dirac % kubectl get pods                               
NAME                                                   READY   STATUS    RESTARTS      AGE
diracx-demo-76dbcdd78f-hc8fj                           1/1     Running   1 (20h ago)   20h
diracx-demo-cert-manager-7469966f7c-2qrd5              1/1     Running   0             20h
diracx-demo-cert-manager-cainjector-6db6d7d979-87hw6   1/1     Running   0             20h
diracx-demo-cert-manager-webhook-c57d6d6b7-x6spc       1/1     Running   0             20h
diracx-demo-cli-5ff5cfd9bd-dwdmq                       1/1     Running   0             20h
diracx-demo-dex-8ccfd8564-xcqm7                        1/1     Running   0             20h
diracx-demo-minio-5674bcbd44-z4s6g                     1/1     Running   0             20h
diracx-demo-mysql-0                                    1/1     Running   0             20h
diracx-demo-rabbitmq-0                                 1/1     Running   0             20h
diracx-demo-web-6bfdcb8578-fq74k                       1/1     Running   0             20h
elasticsearch-master-0                                 1/1     Running   1 (18h ago)   20h
opensearch-cluster-master-0                            1/1     Running   0             20h
chaen commented 2 months ago

(don't forget triple backtick :-)

To me it looks fine. What is the problem ? Everything is running

martynia commented 2 months ago

OK, I added an access policy to my router. As I understand from the docs the policy is activated by adding a line: PilotLogsAccessPolicy = "diracx.routers.pilot_logging.access_policies:PilotLogsAccessPolicy" into pyproject.toml, this is where I spotted a sandbox policy, so I followed. I', assuming that if I comment that line in the toml file above I deactivate the policy.

Now I'm getting 500 on the client side, w/o any json doc I could decode. The server does not print anything. So it is a blind walk. I managed to have 1 successfull response in the past, followed by 500 at the second attempt. That's why I was concerned about the error message from the server, which looked like dead. Any hints how I could trace that 500 on the server side ?

chaen commented 2 months ago

any change in pyproject.toml requires a reinstall of the package, so it is not really suited for run_demo.sh (you could rerun the deployment, but if you are unsure about what you are doing with kubernetes, better just restart the demo, as changing pyproject.toml while coding is relatively rare.

That being said, if you see error 500, there should be something in the logs.

kubectl logs diracx-demo-76dbcdd78f-hc8fj does not give anything ?

martynia commented 2 months ago

It's a different pod now. It just gives the last message when it died (this is a tail of it, mind the time):

2024-07-10 12:08:27,113 - ERROR:    Application startup failed. Exiting.
INFO:     Stopping reloader process [1]
chaen commented 2 months ago

the --previous should give you logs of the previous pod.

martynia commented 2 months ago

Confusions here. The current pod is with my pilot logger service, which used to work in Lyon. Now I'm not sure if my policy is active or not, is the change in the tool file taken into account ?

My client is not suitable for a fresh demo.

martynia commented 2 months ago

Could we have a chat on zoom, it will be quicker.

chaen commented 2 months ago

Confusions here. The current pod is with my pilot logger service, which used to work in Lyon. Now I'm not sure if my policy is active or not, is the change in the tool file taken into account ? My client is not suitable for a fresh demo.

I am awfully confused by this statement

Could we have a chat on zoom, it will be quicker.

Connect on mattermost

martynia commented 2 months ago

Can't do lattermost, my CERN account has been revoked.

martynia commented 2 months ago

you say: but if you are unsure about what you are doing with kubernetes, better just restart the demo, as changing pyproject.toml while coding is relatively rare.

I suspect that adding a policy could be a problem. If I restart the demo with a different .toml, will it be taken into account ?

chaen commented 2 months ago

Can't do lattermost, my CERN account has been revoked. You'll need that eventually ..

If I restart the demo with a different .toml, will it be taken into account ?

Restarting the demo will indeed reinstall diracx, so take the new pyproject.toml into account.

Alternatively, I think that kubectl rollout restart deployment diracx should also do and will be faster

martynia commented 2 months ago

Success!, but only once. The second run returns 500 and no logs at the server side.

/Users/martynia/mambaforge/envs/diracx-dev/bin/python /Users/martynia/Dirac/FastApiTutorial/logger_client.py

Returned message: b'{"pilot_stamp":"78f39a90-2073-11ec-98d7-b496913c0cf4","lines":[{"line_no":0,"line":"2022-02-23 13:48:35.123456 UTC DEBUG [PilotParams] JSON file loaded: pilot.json"},{"line_no":1,"line":"2022-02-23 13:48:36.123456 UTC DEBUG [PilotParams] JSON file analysed: pilot.json"}],"vo":"diracPilot"}' {'pilot_stamp': '78f39a90-2073-11ec-98d7-b496913c0cf4', 'lines': [{'line_no': 0, 'line': '2022-02-23 13:48:35.123456 UTC DEBUG [PilotParams] JSON file loaded: pilot.json'}, {'line_no': 1, 'line': '2022-02-23 13:48:36.123456 UTC DEBUG [PilotParams] JSON file analysed: pilot.json'}], 'vo': 'diracPilot'}

Process finished with exit code 0

martynia commented 2 months ago

But as I said before, sometimes 500 comes at the first attempt.

martynia commented 2 months ago

The router has some logs at the warning level, but nothing came out even when there was a successful connection .

chaen commented 2 months ago

I am afraid it is going to be very difficult to help you debug like that. You'll need to add a lot of printout to see where things crash.

martynia commented 2 months ago

The problem is that there is no logs at the server side. My token expired, I have 401 now (expected), rather than 500, but the server logs show: 2024-07-10 14:50:01,147 - ERROR: Application startup failed. Exiting. INFO: Stopping reloader process [1]

And this is a message from yesterday. Are you not surprised that there are no newer messages at all ?

chaen commented 2 months ago

very much yes. Please get back your CERN account so we can do a bit of interactive debugging on mattermost and/or zoom

martynia commented 2 months ago

I sent you an email, hopefully to the correct account.

martynia commented 2 months ago

Getting the account back could take a week. Details in the email.

chaen commented 2 months ago

So, my findings:

With that, you should be good to go

martynia commented 2 months ago

Hi,

As I said in my mail a few minutes ago, I “fixed” the os bug, and everything is fine.

According to the docs, if data is not None, the method is POST. It worked fine, I definitely have a few records in the os DB.

My policy only allows pilots and operator to post. How do I add an Operator property to DiracAdmin ?

Thanks a lot for clarifying the situation.

Cheers, JM

On 12 Jul 2024, at 17:54, chaen @.***> wrote:

So, my findings:

the client code you sent me was doing a GET, while your router expects a POST, obviously that can't work. You reported that while doing other atemps, it worked once and then failed. You're right, that's a bug (fixed in #276 https://github.com/DIRACGrid/diracx/pull/276) The fact that the application crashes is due to that #275 https://github.com/DIRACGrid/diracx/issues/275 You are sending invalid parameters, so you should get a proper error telling you that and it should be nicely logged, but the check_permissions wrapper interrupts that and commits a suicide (it's the whole idea of it). Except that in this case, it dies too early, before having time to tell you what's wrong. What you can do for now is to comment the following line: https://github.com/DIRACGrid/diracx/blob/2e180849611f4628d4fdaf2a663e1c7b5be94b2d/diracx-routers/src/diracx/routers/access_policies.py#L133 With that, you should be good to go

— Reply to this email directly, view it on GitHub https://github.com/DIRACGrid/diracx/issues/274#issuecomment-2225873888, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2KE7JEVWJI5GPZU3JENITZL734BAVCNFSM6AAAAABKSKY2VCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRVHA3TGOBYHA. You are receiving this because you authored the thread.