caching errors on clean hosts (reprise?)

johnwarburton commented 3 months ago

Hi @dadav

I am migrating our CI pipeline to start using gorge and it looks like I am yet again seeing "token" errors similar to https://github.com/dadav/gorge/issues/4. Our CI system will spin up a fresh new container with an empty puppet modules directory requiring a full r10k install

I can reproduce with my test script with an empty modules and r10k cache directories

[root@sypls062 puppet-forge]# ./run-r10k-parallel.sh 20 http://puppet-forge-proxy 10
Mon Aug  5 12:28:19 PM AEST 2024
Mon Aug  5 12:28:23 PM AEST 2024
12 errors in errors/errors/202408050949-parallel-20-concurrency-10

[root@puppet-forge]# cat errors/202408050949-parallel-20-concurrency-10 | grep ERROR | cut -c 1-200
ERROR    -> Module grubby failed to synchronize due to 434: unexpected token at '{"uri":"/v3/releases/jcpunk-efi-0.1.1","slug":"jcpunk-efi-0.1.1","version":"0.1.1"_strings.html) and [style guide](https:
ERROR    -> Error during concurrent deploy of a module: 434: unexpected token at '{"uri":"/v3/releases/jcpunk-efi-0.1.1","slug":"jcpunk-efi-0.1.1","version":"0.1.1"_strings.html) and [style guide](https
ERROR    -> Module nfs failed to synchronize due to 783: unexpected token at '{"uri":"/v3/releases/puppetlabs-inifile-6.1.0","slug":"puppetlabs-inifile-6.1.0","module":{"uri":"/v3/modules/puppetlabs-ini
ERROR    -> Error during concurrent deploy of a module: 783: unexpected token at '{"uri":"/v3/releases/puppetlabs-inifile-6.1.0","slug":"puppetlabs-inifile-6.1.0","module":{"uri":"/v3/modules/puppetlabs
ERROR    -> 434: unexpected token at '{"uri":"/v3/releases/jcpunk-efi-0.1.1","slug":"jcpunk-efi-0.1.1","version":"0.1.1"_strings.html) and [style guide](https://puppet.com/docs/puppet/latest/puppet_stri
ERROR    -> 783: unexpected token at '{"uri":"/v3/releases/puppetlabs-inifile-6.1.0","slug":"puppetlabs-inifile-6.1.0","module":{"uri":"/v3/modules/puppetlabs-inifile","slug":"puppetlabs-inifile","name"

I am running against the tests against a fully primed cache on a gorge server via a k8s nginx ingress running a single replica. The client is running the latest r10k

I have also run these tests on bare metal hosts with plenty of network, CPU and memory capacity. I still get similar errors, so I do not feel the infrastructure influences this bug

/ $ gorge version
0.6.1-alpha

[root@errors]# /opt/puppetlabs/puppet/bin/ruby --version
ruby 2.7.8p225 (2023-03-30 revision 1f4d455848) [x86_64-linux]
[root@errors]# /opt/puppetlabs/puppet/bin/r10k version
r10k 4.1.0

There are no errors either from the gorge server nor the nginx ingress logs

I can reproduce the issue at will (script attached). The higher the parallelism and higher the r10k concurrency, the sooner you see errors. However I can still see issues with parallelism set at 1 and r10k concurrency at 10. I am the only client hitting this gorge server, so it does not look load related

If there is anything I can do to help pinpoint the issue, please let me know

Regards

John gorge-issue.tgz

johnwarburton commented 3 months ago

Digging further today, I can see some r10k error messages are complaining about JSON which is two lines from different modules merged together, and others are just truncated lines

Here's what the client was thinking was the nfs module but seemed to get the data from prometheus_reporter:

ERROR    -> Module nfs failed to synchronize due to 783: unexpected token at '=\"node.example.com\"} 0.012010700000000003\npuppet_report_time{name=\"Config retrieval\",environment=\"production\",host=\"node.example.com\"} 20.471957786\npuppet_report_time{name=\"Cron\",environment=\"production\",host=\"node.example.com\"} 0.000874118\npuppet_report_time{name=\"Exec\",environment=\"production\",host=\"node.example.com\"} 0.4114313850000001\npuppet_report_time etc etc

Truncations aplenty

10k: Runtime error: #<Faraday::ParsingError wrapped=#<JSON::ParserError: 783: unexpected token at '{"uri":"/v3/releases/puppet-extlib-7.0.0","slug":"puppet-extlib-7.0.0","module":{"uri":"/v3/modules/puppet-extlib","slug":"puppet-extlib","name":"puppet-extlib","owner":{"uri":"/v3/users/Vox

ERROR    -> Module transition failed to synchronize due to 434: unexpected token at '{"uri":"/v3/'

ERROR    -> 783: unexpected token at '-selinux_core-1.0.2","slug":"puppetlabs-selinux_core-1.0.2","version":"1.0.2","file_uri":"/v3/files/puppetlabs-selinux_core-1.0.2.tar.gz","file_size":17916},{"uri":"/v3/releases/puppetlabs-selinux_core-1.0.2","slug":"puppetlabs-selinux_core-1.0.2","version":"1.0.2","file_uri":"/v3/files/puppetlabs-selinux_core-1.0.2.tar.gz","file_size":17916},{"uri":"/v3/releases/puppetlabs-selinux_core-1.0.2","slug":"puppetlabs-selinux_core-1.0.2","version":"1.0.2","file_uri":"/v3/files/puppetlabs-selinux_core-1.0.2.tar.gz","file_size":17916}]}

Thinking about this and maybe it is concurrency, I turned off module scanning --modules-scan-sec 0 but that did not help

I also wondered if the chi router's ability to function under high load with https://github.com/go-chi/stampede needed tweaking and I set the size of the cache to a ridiculously large size in cmd/serve.go to no effect:

cachedMiddleware := stampede.HandlerWithKey(102400, ....

But running tcpdump between client and server, so far I cannot see this truncation happening - I will keep looking

johnwarburton commented 3 months ago

As soon as I sent this comment off, I did see a "merged" response from gorge:

$ tshark -r /var/tmp/forge.tcpdump -VVV -e frame.time -e ip.src -e ip.dst -e http.response.code -e http.response_for.uri -e http.file_data -e http.request.method -e http.request.uri -Tfields| less

Aug  9, 2024 12:17:13.141624000 AEST    10.70.73.83     10.70.71.28     200     http://sypls059:7777/v3/modules/puppet-rsyslog  {"uri":"/v3/releases/puppetlabs-sshkeys_core-2.3.0","slug":"puppetlabs-sshkeys_core-2.3.0","module":{"uri":"/v3/modules/puppetlabs-sshkeys_core","slug":"puppetlabs-sshkeys_core","name":"puppetlabs-sshkeys_core",

line continues
line continues
line continues

puppet_report_time{name=\"File\",environment=\"production\",host=\"node.example.com\"} 0.32955574000000015\npuppet_report_time

so that's a request for puppet-rsyslog and 2 modules merged in the same response:

puppetlabs-sshkeys_core
prometheus_reporter (0.32955574000000015 is a magic number from its README)

dadav commented 3 months ago

Hi @johnwarburton,

thanks for your report, I'm gonna dig into this :)

johnwarburton commented 2 months ago

Hi @dadav

I am stumbling around trying to understand the code - as a beginner to golang and concurrency handling with the chi router

I found an article describing how to use the "go race detector", so I compiled gorge with

$ go build -race

And a lot of race conditions were found in miffleware/proxy like

WARNING: DATA RACE
Write at 0x00c0008e85b8 by goroutine 3355:
  encoding/json.appendString[go.shape.string]()
.
.
.
  github.com/dadav/gorge/internal/middleware.(*capturedResponseWriter).sendCapturedResponse()
      /export/home/jwarburt/puppet-forge/gorge/internal/middleware/proxy.go:37 +0x139
  github.com/dadav/gorge/cmd.init.func2.3.ProxyFallback.3.1()
      /export/home/jwarburt/puppet-forge/gorge/internal/middleware/proxy.go:87 +0x324

I did naively try and add some mutex handling for the capturedResponseWriter structure, but I am still seeing similar race warnings in the same piece of code

type capturedResponseWriter struct {
    http.ResponseWriter
    body   []byte
    status int
    mu     sync.Mutex // Add a mutex to protect shared resources
}

func (w *capturedResponseWriter) WriteHeader(code int) {
    w.mu.Lock()
    defer w.mu.Unlock()
    w.status = code
}

func (w *capturedResponseWriter) Write(body []byte) (int, error) {
    w.mu.Lock()
    defer w.mu.Unlock()
    w.body = body
    return len(body), nil
}

func (w *capturedResponseWriter) sendCapturedResponse() {
    w.mu.Lock()
    defer w.mu.Unlock()
    w.ResponseWriter.WriteHeader(w.status)
    w.ResponseWriter.Write(w.body)
}

I think this all started when the http server was put into its own go routing at https://github.com/dadav/gorge/blob/main/cmd/serve.go#L281

I will keep digging tomorrow

johnwarburton commented 2 months ago

Hi @dadav

The race conditions reported (attached in gorge-race.txt) look to be from writing to the same http.ResponseWriter variable w at

I have tried many ways to isolate the write. It works fine in proxy.go as there is a clear write happening at w.ResponseWriter.Write(w.body) which I wrapped in sendCapturedResponse in ProxyFallback:

        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            // Wrap the response writer with the synchronized response writer
            syncedWriter, ok := w.(*SyncedResponseWriter)
            if !ok {
                syncedWriter = &SyncedResponseWriter{ResponseWriter: w}
            }

            capturedResponseWriter := &capturedResponseWriter{ResponseWriter: syncedWriter}
            next.ServeHTTP(capturedResponseWriter, r)

where some (truncated for readability) middleware SyncedResponseWriter did the locking:

type SyncedResponseWriter struct {
    http.ResponseWriter
    mu sync.Mutex
}

func (w *SyncedResponseWriter) Write(b []byte) (int, error) {
    w.mu.Lock()
    defer w.mu.Unlock()
    return w.ResponseWriter.Write(b)
}

But the EncodeJSONResponse function in routers.go has the write happening at this line return json.NewEncoder(w).Encode(i) where there is no clear write happening - in the race file, we can see it calling json encoding libraries where they are actually doing the write. So it is unclear how to lock the write here as this did not work

    syncedWriter, ok := w.(*customMiddleware.SyncedResponseWriter)
    if !ok {
        syncedWriter = &customMiddleware.SyncedResponseWriter{ResponseWriter: w}
    }
.
.
.
    if i != nil {
        return json.NewEncoder(syncedWriter).Encode(i)
    }

Going to extremes around this code putting in mutexes it will deadlock the code

Earlier in the week, going overboard with the mutexes in proxy.go and the code worked well with no race conditions, but proxying failed completely

I am really out of my depth go wise here. A lot of the working out of the problem came from a go developer at work and the mutex code came from AI coding assistants

Any hints on where to go next would be appreciated

Thanks

John

dadav / gorge

caching errors on clean hosts (reprise?) #20