frappe / press

Full service cloud hosting for the Frappe stack - powers Frappe Cloud
https://frappe.cloud
GNU Affero General Public License v3.0
250 stars 176 forks source link

Blind job retries break sites during updates #1847

Open adityahase opened 3 months ago

adityahase commented 3 months ago

The same job was tried 3 times (499, 200, 200). HTTP 499 is client-side timeout.

Unsure if Agent accepted the job on the first or second attempt. Either way, shouldn't have attempted the 3rd time.

$ less /var/log/nginx/access.log | grep benches/bench-bbbb-000548-fyy-mumbai/sites/site.frappe.cloud/update/migrate
xx.xx.xx.xx - - [06/Jun/2024:06:31:25 +0000] "POST /agent/benches/bench-bbbb-000548-fyy-mumbai/sites/site.frappe.cloud/update/migrate HTTP/1.1" 499 0 "-" "python-requests/2.31.0" "-" "fyy-mumbai.frappe.cloud" 30.106
xx.xx.xx.xx - - [06/Jun/2024:06:32:24 +0000] "POST /agent/benches/bench-bbbb-000548-fyy-mumbai/sites/site.frappe.cloud/update/migrate HTTP/1.1" 200 15 "-" "python-requests/2.31.0" "-" "fyy-mumbai.frappe.cloud" 19.040
xx.xx.xx.xx - - [06/Jun/2024:06:32:43 +0000] "POST /agent/benches/bench-bbbb-000548-fyy-mumbai/sites/site.frappe.cloud/update/migrate HTTP/1.1" 200 15 "-" "python-requests/2.31.0" "-" "fyy-mumbai.frappe.cloud" 0.107
xx.xx.xx.xx - - [06/Jun/2024:06:35:26 +0000] "POST /agent/benches/bench-bbbb-000548-fyy-mumbai/sites/site.frappe.cloud/update/recover HTTP/1.1" 404 460 "-" "python-requests/2.31.0" "-" "fyy-mumbai.frappe.cloud" 0.111

Since all the update steps had failed. Recovery picked the same bench as the target (as expected) and failed with 404 (since the site was already moved).

References: https://frappecloud.com/app/agent-job/6f1745e755 https://frappecloud.com/app/agent-job/b38c4d1821

adityahase commented 3 months ago

HTTP 404 site requests, should probably trigger fetch_bench_from_agent i.e. find wherever the site is and use that as a source of truth.