TraceMachina / nativelink

NativeLink is an open source high-performance build cache and remote execution server, compatible with Bazel, Buck2, Reclient, and other RBE-compatible build systems. It offers drastically faster builds, reduced test flakiness, and specialized hardware.
https://nativelink.com
Apache License 2.0
1.16k stars 109 forks source link

Build stall when CAS reset #245

Closed chrisstaite-menlo closed 1 year ago

chrisstaite-menlo commented 1 year ago

I had everything stall during a full Chromium build, it did all come back to life after a minute or so, but the cause appears to be this:

[2023-09-07T07:59:41.326Z WARN  h2::proto::streams::recv] recv_reset; remotely-reset pending-accept streams reached limit (20)
[2023-09-07T07:59:41.326Z ERROR cas] Failed running service : hyper::Error(Http2, Error { kind: GoAway(b"", ENHANCE_YOUR_CALM, Library) })

I'm not sure why this happened and it's the first time I've seen it.

chrisstaite-menlo commented 1 year ago

Getting quite a lot of this error from workers:

[2023-09-07T08:33:14.560Z WARN  simple_scheduler] Internal error for worker 46cd9913-c4bc-41f9-b1d3-cb3c3fe52d95: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/3147cfb3677571c42217a593af517bd73cd66990c8fcd08c9453030000000000-37957\" : Could not upload data to store in upload_file_to_store : for \"/root/.cache/turbo-cache/work/621d69738280b8cb78d47e763a09fc5f8166dfd0567b3523ecaa8659147b735c/by/goma/obj/third_party/blink/renderer/bindings/modules/v8/v8/v8_payment_response.o.d\" : Error while uploading results"] }
allada commented 1 year ago

It looks like the CAS was under a lot of load and the scheduler tried to do some lookups but the CAS was unable to process more streams due to some other bottleneck and resulted in the queue filling up, then dropping new connections.

We can mitigate some of this with: https://doc.servo.org/h2/server/struct.Builder.html#method.max_pending_accept_reset_streams

However, in this case having more than one CAS might be needed in order to deal with excessive loads.

https://github.com/hyperium/h2/pull/668

chrisstaite-menlo commented 1 year ago

It was running at a load average of 12 and had 300% CPU usage. I was trying to avoid having multiple CAS instances because it becomes a pain to manage. I may have to migrate to one of the S3 style storage solutions and host Turbo-Cache to front it. It should all have been serving out of the memory store though, so it's surprising that it couldn't handle it.

allada commented 1 year ago

I wounder if it got caught up in a lock somewhere? It would be great if this happens again if we could get a core dump.

The only N^2 problem that I am aware of in this project is on the scheduler matching, and since this happened on the CAS i'm certain it's a external influenced thing.

Do you happen to know the disk usage? Tokio's disk commands are not actually async, it actually uses blocking calls in a "block safe thread". By default the system is limited to a fixed amount of these blocking threads. So maybe it ended up with too much backpressure from the disk io?

chrisstaite-menlo commented 1 year ago

Didn't appear to have a particularly high wa but that's anecdotal.

chrisstaite-menlo commented 1 year ago

I'm actually getting a lot of this even when it's not heavily loaded:

[2023-09-07T14:06:14.316Z WARN  simple_scheduler] Internal error for worker 46cd9913-c4bc-41f9-b1d3-cb3c3fe52d95: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/06d463ed3e6600423456cdd4faf3c376f042ec06e38b9f002f2f040000000000-10824\" : Could not upload data to store in upload_file_to_store : Error while uploading results"] }
[2023-09-07T14:07:06.437Z WARN  simple_scheduler] Internal error for worker 46cd9913-c4bc-41f9-b1d3-cb3c3fe52d95: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/bd47cdb64c9fcf99f9b92f91ac9e407be88320381cc73a867d60040000000000-9453\" : Could not upload data to store in upload_file_to_store : Error while uploading results"] }
[2023-09-07T14:07:07.557Z WARN  simple_scheduler] Internal error for worker 46cd9913-c4bc-41f9-b1d3-cb3c3fe52d95: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/bdf50ad4b2d7bf49796d02e6f68de97c8c051c134cbb95ab5461040000000000-9838\" : Could not upload data to store in upload_file_to_store : Error while uploading results"] }
[2023-09-07T14:07:09.229Z WARN  simple_scheduler] Internal error for worker 46cd9913-c4bc-41f9-b1d3-cb3c3fe52d95: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e7856deacd8906918a34ef8af8d16a120ba3d45fc678d8f2b262040000000000-9958\" : Could not upload data to store in upload_file_to_store : Error while uploading results"] }
[2023-09-07T14:07:14.642Z WARN  simple_scheduler] Internal error for worker 46cd9913-c4bc-41f9-b1d3-cb3c3fe52d95: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/0b4fff3fa0cd03d3e65189ba9ee94986feb26539ba1fcc3cfd6a040000000000-9197\" : Could not upload data to store in upload_file_to_store : Error while uploading results"] }
[2023-09-07T14:07:17.001Z WARN  simple_scheduler] Internal error for worker 46cd9913-c4bc-41f9-b1d3-cb3c3fe52d95: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/a350640a5de6855a692c7e81fd9b66dd1e2a27dffcccfd44ba70040000000000-12239\" : Could not upload data to store in upload_file_to_store : Error while uploading results"] }

I think we may have introduced a problem recently. I just updated the cluster to the latest main.

Only thing in the CAS logs are:

[2023-09-07T14:02:52.560Z ERROR bytestream_server] Write Resp: 2.7540956 None Status { code: Internal, message: "Stream error at byte 8388608 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:02:52.561Z ERROR bytestream_server] Write Resp: 8.560164 None Status { code: Internal, message: "Stream error at byte 25165824 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:02:52.586Z ERROR bytestream_server] Write Resp: 9.943344 None Status { code: Internal, message: "Stream error at byte 29360128 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:02:52.675Z ERROR bytestream_server] Write Resp: 0.034807287 None Status { code: Internal, message: "Stream error at byte 2097152 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:02:52.692Z ERROR bytestream_server] Write Resp: 3.1737354 None Status { code: Internal, message: "Stream error at byte 12582912 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:02:52.727Z ERROR bytestream_server] Write Resp: 0.07661839 None Status { code: Internal, message: "Stream error at byte 2097152 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:02:52.734Z ERROR bytestream_server] Write Resp: 0.0853766 None Status { code: Internal, message: "Stream error at byte 2097152 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:08:35.537Z ERROR bytestream_server] Write Resp: 0.012445173 None Status { code: Internal, message: "Stream error at byte 20480 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:08:35.559Z ERROR bytestream_server] Write Resp: 0.010165742 None Status { code: Internal, message: "Stream error at byte 24576 : In ByteStreamServer::write()", source: None }
[2023-09-07T14:08:40.401Z ERROR bytestream_server] Write Resp: 0.012167023 None Status { code: Internal, message: "Stream error at byte 65536 : In ByteStreamServer::write()", source: None }
allada commented 1 year ago

This would be very useful to get some metrics published under prometheus.

chrisstaite-menlo commented 1 year ago

I'm getting this on the worker:

[2023-09-07T08:32:24.893Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c945f030000000000-0\""] }
[2023-09-07T14:08:35.580Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.618Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.645Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.645Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.655Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.655Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.675Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.676Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.676Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
[2023-09-07T14:08:35.676Z ERROR filesystem_store] Error { code: NotFound, messages: ["No such file or directory (os error 2)", "Failed to touch file in filesystem store \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934c7ca3040000000000-0\""] }
allada commented 1 year ago

Can you try reverting 67b90e2c9a254687b7525053bb4153f95e216b9d and try?

That commit was significant and is the only major change I can think of that might cause this.

chrisstaite-menlo commented 1 year ago

Just doing a build with that reverted now.

chrisstaite-menlo commented 1 year ago

Now running with that build and it appears a lot happier, although it's only been up for a few seconds, so I'll try and exercise it to check.

chrisstaite-menlo commented 1 year ago

I've just hit it with a totally new branch and it's flying along without any issues. The CAS is running at a load average of 18 and rising and 300% CPU with a 2.8 wa but there's no stalls or breakages.

Looks like there might be an issue with 67b90e2c9a254687b7525053bb4153f95e216b9d.

allada commented 1 year ago

Dang, I will revert that commit. I may need to find another way to do what it's trying to do. I tried a few strategies, but this was the one i hated the least.

chrisstaite-menlo commented 1 year ago

I spoke too soon, getting stalls again. When it stalls the scheduler and remoteexec_proxy CPU usage drops to a tiny amount and get these:

[2023-09-07T14:54:23.990Z WARN  simple_scheduler] Internal error for worker 46ecd873-ab4f-4683-958f-c9b5b1db72b1: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/fcbb86c63a3e9f318ef3e082d7cb809dcc9df658dd9b22a71477000000000000-10778 : Could not upload data to store in upload_to_store : Error while uploading results"] }
[2023-09-07T14:54:25.290Z WARN  simple_scheduler] Internal error for worker 46ecd873-ab4f-4683-958f-c9b5b1db72b1: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/fe55b3977c64b4d0408e121f3e62406afb2258b73fbe6e192277000000000000-10888 : Could not upload data to store in upload_to_store : Error while uploading results"] }
[2023-09-07T14:54:27.344Z WARN  simple_scheduler] Internal error for worker dbfeef84-d1ad-4f6f-9f1e-9681195f2e35: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/0d1a3dcffaedfe4a96e4ab48a6b97b627e407045c8393d63a378000000000000-8344 : Could not upload data to store in upload_to_store : for \"/root/.cache/turbo-cache/work/7403455e0ecaa6f782afa837ca22e09c11802d6424b144758d4dbec5de1d36f5/by/goma/obj/components/safety_check/safety_check/update_check_helper.o\" : Error while uploading results"] }
[2023-09-07T14:54:36.713Z WARN  simple_scheduler] Internal error for worker 46ecd873-ab4f-4683-958f-c9b5b1db72b1: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/faef7fb3995348a490c099684ef8ff48c5a4d21a44cbaaf55079000000000000-8988 : Could not upload data to store in upload_to_store : for \"/root/.cache/turbo-cache/work/a9ef3a5608d364d407ed72a4542393a99aba5bdf327764eca13b7ba4a7f861ad/by/goma/obj/components/security_interstitials/core/core/bad_clock_ui.o.d\" : Error while uploading results"] }
[2023-09-07T14:54:37.671Z WARN  simple_scheduler] Internal error for worker 02ca6e82-c420-4286-b294-cb61181d3b9b: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/6e6561700bbfbd48a75972966e5bf527a0f7cea14f3ff50cdf77000000000000-9735 : Could not upload data to store in upload_to_store : for \"/root/.cache/turbo-cache/work/ac584fe072761ea0a2a91e8f5a0b3a203e9db4880927581c3ce4cf405ef0d7ab/by/goma/obj/components/security_interstitials/core/core/safe_browsing_quiet_error_ui.o.d\" : Error while uploading results"] }
[2023-09-07T14:54:37.976Z WARN  simple_scheduler] Internal error for worker 02ca6e82-c420-4286-b294-cb61181d3b9b: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/64f350e0acf2449906c1e7dea239a9fb5600e73b8a88f2e2fa77000000000000-8368 : Could not upload data to store in upload_to_store : for \"/root/.cache/turbo-cache/work/428835674a916f53f6e8b1fdde29879a719d8a6bb3216f5c954e9a57145861b6/by/goma/obj/components/security_interstitials/core/core/controller_client.o\" : Error while uploading results"] }
[2023-09-07T14:54:38.450Z WARN  simple_scheduler] Internal error for worker 02ca6e82-c420-4286-b294-cb61181d3b9b: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/148722dcd9a71441c5783525482540187c087ae9d17246f21278000000000000-10085 : Could not upload data to store in upload_to_store : for \"/root/.cache/turbo-cache/work/6f35c0b9223d00b71422bec7c5a8a1a585f9bcd44a840064b25871c762071605/by/goma/obj/components/security_interstitials/core/core/https_only_mode_enforcelist.o.d\" : Error while uploading results"] }
[2023-09-07T14:54:38.509Z WARN  simple_scheduler] Internal error for worker dbfeef84-d1ad-4f6f-9f1e-9681195f2e35: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/346bbdc66eeff2390a88ac5a1707ef7195645ef122d9c366bc7a000000000000-11609 : Could not upload data to store in upload_to_store : for \"/root/.cache/turbo-cache/work/a03735e7d1e744370ec03f9c39ad99fa6390ce6d607efca2d68a7fa31682cee7/by/goma/obj/components/segmentation_platform/internal/internal/field_trial_recorder.o.d\" : Error while uploading results"] }
[2023-09-07T14:54:39.107Z WARN  simple_scheduler] Internal error for worker 02ca6e82-c420-4286-b294-cb61181d3b9b: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file /root/.cache/turbo-cache/data-worker-test/tmp_path-cas/9c060331fbf2cf65e497a425ff6f33bde624f64b8fbe41d84178000000000000-11684 : Could not upload data to store in upload_to_store : for \"/root/.cache/turbo-cache/work/5cff60d775f277c693c2ad6a7b4e1f39207dcd97ed67beb3a2adb273315382d9/by/goma/obj/components/segmentation_platform/internal/internal/metadata_utils.o.d\" : Error while uploading results"] }
chrisstaite-menlo commented 1 year ago

I think before I got these issues I was running a really old version back at around 259224b28ec8b2f9d878bf079ddaea679baf082a

chrisstaite-menlo commented 1 year ago

I don't suppose it could be 8ac4824d1d58379348b50a52cad331e417d1accf? The only other option is b0b66a126b6fb85668b098fc3d26678504b1a330 I think.

allada commented 1 year ago

Hmmm, would it be too much work to bisect it?

What hash where you running before attempting to upgrade?

allada commented 1 year ago

We might be conflating two issues.

The failed to upload stuff looks like it's failing a soft assert:

[WARN  simple_scheduler] Internal error for worker 46cd9913-c4bc-41f9-b1d3-cb3c3fe52d95: Error { code: Internal, messages: ["Receiver went away before receiving EOF : Failed to write eof to fast store in fast_slow store update : --- : Writer was dropped before EOF was sent : Failed to receive data in filesystem store : While processing with temp file \"/root/.cache/turbo-cache/data-worker-test/tmp_path-cas/06d463ed3e6600423456cdd4faf3c376f042ec06e38b9f002f2f040000000000-10824\" : Could not upload data to store in upload_file_to_store : Error while uploading results"] }

Triggered in: https://github.com/allada/turbo-cache/blob/main/cas/scheduler/simple_scheduler.rs#L468

I could see a couple ways this might happen, but none of them should be critical.

Does this actually fail the build or just put noise in the logs?

chrisstaite-menlo commented 1 year ago

The build doesn't fail it gets a RETRY entry in the Goma logs and stalls everything, it's like it locks up. I'm currently trying with just 8ac4824d1d58379348b50a52cad331e417d1accf reverted to see what I get.

chrisstaite-menlo commented 1 year ago

Reverting that change does appear to have fixed it. And I think I have a reason why, previously when 128 simultaneous builds produced their outputs they only generated 128 concurrent uploads of results to the CAS. However, now they upload the stdout and stderr too, so they have 384 simultaneous uploads. Combine that with the downloads and you can see how the limit might be reached on the CAS. This is simply a scale thing I think.

chrisstaite-menlo commented 1 year ago

Having said that, there are still four RETRY entries in Goma for this build, so perhaps it's still happening sometimes?

chrisstaite-menlo commented 1 year ago

I think that these logs in the CAS indicate that an issue occurred:

[2023-09-07T15:57:36.444Z ERROR bytestream_server] Write Resp: 0.16644613 None Status { code: Internal, message: "Stream error at byte 24576 : In ByteStreamServer::write()", source: None }
[2023-09-07T15:58:09.957Z ERROR bytestream_server] Write Resp: 0.15555136 None Status { code: Internal, message: "Stream error at byte 36864 : In ByteStreamServer::write()", source: None }
[2023-09-07T15:58:26.965Z ERROR bytestream_server] Write Resp: 0.004299194 None Status { code: Internal, message: "Stream error at byte 32768 : In ByteStreamServer::write()", source: None }
[2023-09-07T16:02:00.397Z ERROR bytestream_server] Write Resp: 0.015137762 None Status { code: Internal, message: "Stream error at byte 20480 : In ByteStreamServer::write()", source: None }
[2023-09-07T16:02:07.979Z ERROR bytestream_server] Write Resp: 0.024254864 None Status { code: Internal, message: "Stream error at byte 45056 : In ByteStreamServer::write()", source: None }

Although they don't really explain what that issue was...

allada commented 1 year ago

In looking into this, it's super strange, the reader and writer are both complaining about each other:

--> Receiver went away before receiving EOF ...
--> Writer was dropped before EOF was sent ...

This is super confusing.

chrisstaite-menlo commented 1 year ago

Should there just be a retry in there perhaps?

allada commented 1 year ago

I guess we could. I'd rather figure out the root cause though, because it might point to a more fundamental issue.

I'm still investigating what might be going on... The entire reason I creates .err_tip() was because buildbarn had the same issue with not giving much hints into where the problems where for this stuff, and .err_tip() was supposed to help... but I guess it's not enough :-(

allada commented 1 year ago

Can you share your worker config file?

chrisstaite-menlo commented 1 year ago
{
  "stores": {
    "GRPC_LOCAL_STORE": {
      // Note: This file is used to test GRPC store.
      "grpc": {
        "instance_name": "main",
        "endpoints": ["grpc://${CAS_ENDPOINT:-127.0.0.1:50051}"],
        "store_type": "CAS"
      }
    },
    "GRPC_LOCAL_AC_STORE": {
      // Note: This file is used to test GRPC store.
      "grpc": {
        "instance_name": "main",
        "endpoints": ["grpc://${CAS_ENDPOINT:-127.0.0.1:50051}"],
        "store_type": "AC"
      }
    },
    "WORKER_FAST_SLOW_STORE": {
      "fast_slow": {
        "fast": {
          "filesystem": {
            "content_path": "/root/.cache/turbo-cache/data-worker-test/content_path-cas",
            "temp_path": "/root/.cache/turbo-cache/data-worker-test/tmp_path-cas",
            "eviction_policy": {
              // 1gb.
              "max_bytes": 1000000000,
            }
          }
        },
        "slow": {
          "ref_store": {
            "name": "GRPC_LOCAL_STORE",
          }
        }
      }
    }
  },
  "workers": [{
    "local": {
      "worker_api_endpoint": {
        "uri": "grpc://${SCHEDULER_ENDPOINT:-127.0.0.1:50061}",
      },
      "cas_fast_slow_store": "WORKER_FAST_SLOW_STORE",
      "ac_store": "GRPC_LOCAL_AC_STORE",
      "work_directory": "/root/.cache/turbo-cache/work",
      "platform_properties": {
        "OSFamily": {
          "values": ["Linux"],
        },
        "cpu_count": {
          "query_cmd": "nproc"
        },
      },
      "precondition_script": "/root/precondition_script.sh"
    }
  }],
  "servers": []
}
allada commented 1 year ago

Well, found one bug. I'm pretty sure it is not the root cause of anything, but it's still wrong (fast_rx is going to slow store and vice versa): https://github.com/allada/turbo-cache/blob/main/cas/store/fast_slow_store.rs#L161

chrisstaite-menlo commented 1 year ago

That would explain why the log says it's populating the fast store but it's remote...

allada commented 1 year ago

I'm pretty sure what is happening is that the GRPC store (or somewhere down stream) is getting the exact number of bytes needed, then closing the stream before the sender is able to send an EOF.

In other words, somewhere down stream says: "oh, i got all the data you are going to send, i might as well close the connection", even though there's a zero byte message that is required to be sent (the EOF message).

If this is the case, I'd bet money that it's on the fence-post (like 4096 byte), and somewhere there's a short circuit on "exact bytes received".

chrisstaite-menlo commented 1 year ago

Another fence post?

[edit] should finish reading your message before commenting :)

Stream errors all show current byte when it fails:

24576 = 6 * 4096
36864 = 9 * 4096
32768 = 8 * 4096
20480 = 5 * 4096
45056 = 11 * 4096
allada commented 1 year ago

Out of a test theory can you change: DEFAULT_READ_BUFF_SIZE to some weird number like 1001 and retry? Lets see if this theory hold (lowering the number in theory should make it happen more frequent).

chrisstaite-menlo commented 1 year ago

For the theory to work DEFAULT_READ_BUFF_SIZE needs to be a multiple of the file I'm downloading or uploading, not just a smaller number, right?

allada commented 1 year ago

Yes. But my idea is that there are so many files being uploaded that one of them is bound to be hit.

chrisstaite-menlo commented 1 year ago

Unfortunately, I'll have to look at it tomorrow now. I think we're certainly on to something though. It might also explain why the integration tests are sometimes flaky.

allada commented 1 year ago

Can you also share your CAS config?

allada commented 1 year ago

Oh the red herring... This error message is super miss-leading...

All error messages after: Writer was dropped before EOF was sent is caused because slow_tx gets dropped because fast_tx.send_eof() sent an error and caused slow_tx to be dropped.

https://github.com/allada/turbo-cache/blob/f42f150926c23faba7aa63ba62a40eabb1ce8b20/cas/store/fast_slow_store.rs#L131

I'm very certain GrpcStore::update() is returning an Ok(()), but it's also the one not reading the EOF, so it's looking promising.

chrisstaite commented 1 year ago

Si you still need the CAS config?

allada commented 1 year ago

Let me see how far I can go, but it'd be nice to have.

chrisstaite-menlo commented 1 year ago
{
  "stores": {
    "CAS_MAIN_STORE": {
      "fast_slow": {
        "fast": {
          "memory": {
            "eviction_policy": {
              // 4gb
              "max_bytes": 4000000000
            }
          }
        },
        "slow": {
          "filesystem": {
            "content_path": "/root/.cache/turbo-cache/content_path-cas",
            "temp_path": "/root/.cache/turbo-cache/tmp_path-cas",
            "eviction_policy": {
              // 150gb.
              "max_bytes": 150000000000,
              // 2gb
              "evict_bytes": 2000000000
            }
          }
        }
      }
    },
    "AC_MAIN_STORE": {
      "fast_slow": {
        "fast": {
          "memory": {
            "eviction_policy": {
              // 500mb
              "max_bytes": 500000000
            }
          }
        },
        "slow": {
           "filesystem": {
             "content_path": "/root/.cache/turbo-cache/content_path-cas_ac",
             "temp_path": "/root/.cache/turbo-cache/tmp_path-cas_ac",
             "eviction_policy": {
                // 10gb.
                "max_bytes": 10000000000,
             }
          }
        }
      }
    }
  },
  "servers": [{
    "listen_address": "0.0.0.0:50052",
    "services": {
      "cas": {
        "main": {
          "cas_store": "CAS_MAIN_STORE"
        },
      },
      "ac": {
        "main": {
          "ac_store": "AC_MAIN_STORE"
        }
      },
      "capabilities": {},
      "bytestream": {
        "cas_stores": {
          "main": "CAS_MAIN_STORE",
        },
        // According to https://github.com/grpc/grpc.github.io/issues/371 16KiB - 64KiB is optimal.
        "max_bytes_per_stream": 64000, // 64kb.
      }
    }
  }]
}
allada commented 1 year ago

I think I found the problem: https://github.com/allada/turbo-cache/blob/f42f150926c23faba7aa63ba62a40eabb1ce8b20/cas/grpc_service/bytestream_server.rs#L398

It sees that the exact number of bytes have been received, publishes a EOF to the downstream store, then closes the channel without reading the possible EOF from the GRPC client's store.

I think to fix this we just need to try to read one last message here: https://github.com/allada/turbo-cache/blob/f42f150926c23faba7aa63ba62a40eabb1ce8b20/cas/store/grpc_store.rs#L306

and if it returns EOF or an error ignore it. If it returns more data, it means it's an error.

chrisstaite commented 1 year ago

DropCloserWriteHalf.send_eof should check close_after_size before producing an error, right?

Or the drop for DropCloserWriteHalf should check before sending the error

chrisstaite-menlo commented 1 year ago

The only way that error can become set is if stream.next().await is Err, and calling it again is unlikely to make it something other than Err, therefore your proposal is essentially removing the error from GrpcStore.write

chrisstaite-menlo commented 1 year ago

The big problem is that inner_upload_file_to_store only returns the Err from the first Future to have one, so we don't actually get the error from the GrpcStore only the reader, which has had it's other half dropped, so it always errors.

chrisstaite-menlo commented 1 year ago

I tried adding:

            let data = if data.is_empty() {
                local_state.reader.recv().await.unwrap_or(data)
            } else {
                data
            };

to the unfold in GrpcStore.update. Didn't help. Still getting loads of:

"Receiver went away before receiving EOF : Failed to write eof to slow store in fast_slow store update : Could not upload data to store in upload_file_to_store : for \"/root/.cache/turbo-cache/work/7cadf5b8cacd40c29a4cf5f691fad8d0d3c7f7ffe324aae81ed0cd4aaaa7769e/by/goma/obj/gpu/command_buffer/service/gles2_sources/shared_image_backing_factory.o\" : Error while uploading results"] }

It's super weird. I modified all of the paths to merge errors now and I don't have any errors from any other paths, so I don't understand how the Receiver is being dropped.

chrisstaite-menlo commented 1 year ago

Can I suggest we also add this to avoid closing both ends of the pipe in error cases:

diff --git a/cas/store/ac_utils.rs b/cas/store/ac_utils.rs
index cb42067..bf0fb44 100644
--- a/cas/store/ac_utils.rs
+++ b/cas/store/ac_utils.rs
@@ -17,7 +17,7 @@ use std::io::Cursor;
 use std::pin::Pin;

 use bytes::BytesMut;
-use futures::{future::try_join, Future, FutureExt, TryFutureExt};
+use futures::{join, Future, FutureExt};
 use prost::Message;
 use sha2::{Digest, Sha256};
 use tokio::io::{AsyncRead, AsyncReadExt};
@@ -111,7 +111,13 @@ fn inner_upload_file_to_store<'a, Fut: Future<Output = Result<(), Error>> + 'a>(
     let upload_file_to_store_fut = cas_store
         .update(digest, rx, UploadSizeInfo::ExactSize(digest.size_bytes as usize))
         .map(|r| r.err_tip(|| "Could not upload data to store in upload_file_to_store"));
-    try_join(read_data_fn(tx), upload_file_to_store_fut).map_ok(|(_, _)| ())
+    let read_data_fut = read_data_fn(tx);
+    async move {
+        // Ensure we get errors reported from both sides
+        let (upload_result, read_result) = join!(upload_file_to_store_fut, read_data_fut);
+        upload_result.merge(read_result)?;
+        Ok(())
+    }
 }

 /// Uploads data to our store for given digest.
chrisstaite-menlo commented 1 year ago

Confirming that I am not seeing this issue with PR #250 applied.

allada commented 1 year ago

Added your suggestion (but modified).