Open martynia opened 3 months ago
@chrisburr @chaen.
It's expected upon creation, the pod should redeploy itself and everything should be fine then
OK, but shouldn't the success be visible in the logs ? kubectl get pods
shows the demo pod ready 1/1
But the logs:
INFO: Stopping reloader process [1]
How long do I have to wait for redeployments ? 10 minutes have passed now, and the logs don't show any sign of life. I'm pretty sure I waited longer that 10 minutes with no success.
I think it is properly dead. I has never restarted after 15:07 GMT
what does kubectl get pods
show ?
martynia@Januszs-Laptop Dirac % kubectl get pods
NAME READY STATUS RESTARTS AGE
diracx-demo-76dbcdd78f-hc8fj 1/1 Running 1 (20h ago) 20h
diracx-demo-cert-manager-7469966f7c-2qrd5 1/1 Running 0 20h
diracx-demo-cert-manager-cainjector-6db6d7d979-87hw6 1/1 Running 0 20h
diracx-demo-cert-manager-webhook-c57d6d6b7-x6spc 1/1 Running 0 20h
diracx-demo-cli-5ff5cfd9bd-dwdmq 1/1 Running 0 20h
diracx-demo-dex-8ccfd8564-xcqm7 1/1 Running 0 20h
diracx-demo-minio-5674bcbd44-z4s6g 1/1 Running 0 20h
diracx-demo-mysql-0 1/1 Running 0 20h
diracx-demo-rabbitmq-0 1/1 Running 0 20h
diracx-demo-web-6bfdcb8578-fq74k 1/1 Running 0 20h
elasticsearch-master-0 1/1 Running 1 (18h ago) 20h
opensearch-cluster-master-0 1/1 Running 0 20h
(don't forget triple backtick :-)
To me it looks fine. What is the problem ? Everything is running
OK, I added an access policy to my router. As I understand from the docs the policy is activated by adding a line:
PilotLogsAccessPolicy = "diracx.routers.pilot_logging.access_policies:PilotLogsAccessPolicy"
into pyproject.toml
, this is where I spotted a sandbox policy, so I followed. I', assuming that if I comment that line in the toml file above I deactivate the policy.
Now I'm getting 500 on the client side, w/o any json doc I could decode. The server does not print anything. So it is a blind walk. I managed to have 1 successfull response in the past, followed by 500 at the second attempt. That's why I was concerned about the error message from the server, which looked like dead. Any hints how I could trace that 500 on the server side ?
any change in pyproject.toml
requires a reinstall of the package, so it is not really suited for run_demo.sh
(you could rerun the deployment
, but if you are unsure about what you are doing with kubernetes
, better just restart the demo, as changing pyproject.toml
while coding is relatively rare.
That being said, if you see error 500, there should be something in the logs.
kubectl logs diracx-demo-76dbcdd78f-hc8fj
does not give anything ?
It's a different pod now. It just gives the last message when it died (this is a tail of it, mind the time):
2024-07-10 12:08:27,113 - ERROR: Application startup failed. Exiting.
INFO: Stopping reloader process [1]
the --previous
should give you logs of the previous pod.
Confusions here. The current pod is with my pilot logger service, which used to work in Lyon. Now I'm not sure if my policy is active or not, is the change in the tool file taken into account ?
My client is not suitable for a fresh demo.
Could we have a chat on zoom, it will be quicker.
Confusions here. The current pod is with my pilot logger service, which used to work in Lyon. Now I'm not sure if my policy is active or not, is the change in the tool file taken into account ? My client is not suitable for a fresh demo.
I am awfully confused by this statement
Could we have a chat on zoom, it will be quicker.
Connect on mattermost
Can't do lattermost, my CERN account has been revoked.
you say: but if you are unsure about what you are doing with kubernetes, better just restart the demo, as changing pyproject.toml while coding is relatively rare.
I suspect that adding a policy could be a problem. If I restart the demo with a different .toml, will it be taken into account ?
Can't do lattermost, my CERN account has been revoked. You'll need that eventually ..
If I restart the demo with a different .toml, will it be taken into account ?
Restarting the demo will indeed reinstall diracx, so take the new pyproject.toml
into account.
Alternatively, I think that kubectl rollout restart deployment diracx
should also do and will be faster
Success!, but only once. The second run returns 500 and no logs at the server side.
/Users/martynia/mambaforge/envs/diracx-dev/bin/python /Users/martynia/Dirac/FastApiTutorial/logger_client.py
Returned message: b'{"pilot_stamp":"78f39a90-2073-11ec-98d7-b496913c0cf4","lines":[{"line_no":0,"line":"2022-02-23 13:48:35.123456 UTC DEBUG [PilotParams] JSON file loaded: pilot.json"},{"line_no":1,"line":"2022-02-23 13:48:36.123456 UTC DEBUG [PilotParams] JSON file analysed: pilot.json"}],"vo":"diracPilot"}' {'pilot_stamp': '78f39a90-2073-11ec-98d7-b496913c0cf4', 'lines': [{'line_no': 0, 'line': '2022-02-23 13:48:35.123456 UTC DEBUG [PilotParams] JSON file loaded: pilot.json'}, {'line_no': 1, 'line': '2022-02-23 13:48:36.123456 UTC DEBUG [PilotParams] JSON file analysed: pilot.json'}], 'vo': 'diracPilot'}
Process finished with exit code 0
But as I said before, sometimes 500 comes at the first attempt.
The router has some logs at the warning level, but nothing came out even when there was a successful connection .
I am afraid it is going to be very difficult to help you debug like that. You'll need to add a lot of printout to see where things crash.
The problem is that there is no logs at the server side. My token expired, I have 401 now (expected), rather than 500, but the server logs show:
2024-07-10 14:50:01,147 - ERROR: Application startup failed. Exiting. INFO: Stopping reloader process [1]
And this is a message from yesterday. Are you not surprised that there are no newer messages at all ?
very much yes. Please get back your CERN account so we can do a bit of interactive debugging on mattermost and/or zoom
I sent you an email, hopefully to the correct account.
Getting the account back could take a week. Details in the email.
So, my findings:
GET
, while your router expects a POST
, obviously that can't work.check_permissions
wrapper interrupts that and commits a suicide (it's the whole idea of it). Except that in this case, it dies too early, before having time to tell you what's wrong. What you can do for now is to comment the following line: https://github.com/DIRACGrid/diracx/blob/2e180849611f4628d4fdaf2a663e1c7b5be94b2d/diracx-routers/src/diracx/routers/access_policies.py#L133With that, you should be good to go
Hi,
As I said in my mail a few minutes ago, I “fixed” the os bug, and everything is fine.
According to the docs, if data is not None, the method is POST. It worked fine, I definitely have a few records in the os DB.
My policy only allows pilots and operator to post. How do I add an Operator property to DiracAdmin ?
Thanks a lot for clarifying the situation.
Cheers, JM
On 12 Jul 2024, at 17:54, chaen @.***> wrote:
So, my findings:
the client code you sent me was doing a GET, while your router expects a POST, obviously that can't work. You reported that while doing other atemps, it worked once and then failed. You're right, that's a bug (fixed in #276 https://github.com/DIRACGrid/diracx/pull/276) The fact that the application crashes is due to that #275 https://github.com/DIRACGrid/diracx/issues/275 You are sending invalid parameters, so you should get a proper error telling you that and it should be nicely logged, but the check_permissions wrapper interrupts that and commits a suicide (it's the whole idea of it). Except that in this case, it dies too early, before having time to tell you what's wrong. What you can do for now is to comment the following line: https://github.com/DIRACGrid/diracx/blob/2e180849611f4628d4fdaf2a663e1c7b5be94b2d/diracx-routers/src/diracx/routers/access_policies.py#L133 With that, you should be good to go
— Reply to this email directly, view it on GitHub https://github.com/DIRACGrid/diracx/issues/274#issuecomment-2225873888, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2KE7JEVWJI5GPZU3JENITZL734BAVCNFSM6AAAAABKSKY2VCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRVHA3TGOBYHA. You are receiving this because you authored the thread.
Hello, The demo crashes w/o any interactions from my client. A freshly cloned demo:
2024-07-09 08:09:37,364 - ERROR: Traceback (most recent call last): File "/opt/conda/lib/python3.11/site-packages/starlette/routing.py", line 732, in lifespan async with self.lifespan_context(app) as maybe_state: File "/opt/conda/lib/python3.11/contextlib.py", line 210, in aenter return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/diracx_source/diracx/diracx-routers/src/diracx/routers/fastapi_classes.py", line 35, in lifespan await asyncio.gather( File "/opt/conda/lib/python3.11/contextlib.py", line 650, in enter_async_context result = await _enter(cm) ^^^^^^^^^^^^^^^^ File "/opt/conda/lib/python3.11/contextlib.py", line 210, in aenter return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/diracx_source/diracx/diracx-routers/src/diracx/routers/job_manager/sandboxes.py", line 62, in lifetime_function if not await s3_bucket_exists(self._client, self.bucket_name): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/diracx_source/diracx/diracx-core/src/diracx/core/s3.py", line 29, in s3_bucket_exists return await _s3_exists(s3_client.head_bucket, Bucket=bucket_name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/diracx_source/diracx/diracx-core/src/diracx/core/s3.py", line 39, in _s3_exists await method(**kwargs) File "/opt/conda/lib/python3.11/site-packages/aiobotocore/client.py", line 411, in _make_api_call raise error_class(parsed_response, operation_name) botocore.exceptions.ClientError: An error occurred (403) when calling the HeadBucket operation: Forbidden