Closed johnwarburton closed 2 months ago
Digging further today, I can see some r10k
error messages are complaining about JSON which is two lines from different modules merged together, and others are just truncated lines
Here's what the client was thinking was the nfs
module but seemed to get the data from prometheus_reporter
:
ERROR -> Module nfs failed to synchronize due to 783: unexpected token at '=\"node.example.com\"} 0.012010700000000003\npuppet_report_time{name=\"Config retrieval\",environment=\"production\",host=\"node.example.com\"} 20.471957786\npuppet_report_time{name=\"Cron\",environment=\"production\",host=\"node.example.com\"} 0.000874118\npuppet_report_time{name=\"Exec\",environment=\"production\",host=\"node.example.com\"} 0.4114313850000001\npuppet_report_time etc etc
Truncations aplenty
10k: Runtime error: #<Faraday::ParsingError wrapped=#<JSON::ParserError: 783: unexpected token at '{"uri":"/v3/releases/puppet-extlib-7.0.0","slug":"puppet-extlib-7.0.0","module":{"uri":"/v3/modules/puppet-extlib","slug":"puppet-extlib","name":"puppet-extlib","owner":{"uri":"/v3/users/Vox
ERROR -> Module transition failed to synchronize due to 434: unexpected token at '{"uri":"/v3/'
ERROR -> 783: unexpected token at '-selinux_core-1.0.2","slug":"puppetlabs-selinux_core-1.0.2","version":"1.0.2","file_uri":"/v3/files/puppetlabs-selinux_core-1.0.2.tar.gz","file_size":17916},{"uri":"/v3/releases/puppetlabs-selinux_core-1.0.2","slug":"puppetlabs-selinux_core-1.0.2","version":"1.0.2","file_uri":"/v3/files/puppetlabs-selinux_core-1.0.2.tar.gz","file_size":17916},{"uri":"/v3/releases/puppetlabs-selinux_core-1.0.2","slug":"puppetlabs-selinux_core-1.0.2","version":"1.0.2","file_uri":"/v3/files/puppetlabs-selinux_core-1.0.2.tar.gz","file_size":17916}]}
Thinking about this and maybe it is concurrency, I turned off module scanning --modules-scan-sec 0
but that did not help
I also wondered if the chi
router's ability to function under high load with https://github.com/go-chi/stampede needed tweaking and I set the size of the cache to a ridiculously large size in cmd/serve.go
to no effect:
cachedMiddleware := stampede.HandlerWithKey(102400, ....
But running tcpdump
between client and server, so far I cannot see this truncation happening - I will keep looking
As soon as I sent this comment off, I did see a "merged" response from gorge
:
$ tshark -r /var/tmp/forge.tcpdump -VVV -e frame.time -e ip.src -e ip.dst -e http.response.code -e http.response_for.uri -e http.file_data -e http.request.method -e http.request.uri -Tfields| less
Aug 9, 2024 12:17:13.141624000 AEST 10.70.73.83 10.70.71.28 200 http://sypls059:7777/v3/modules/puppet-rsyslog {"uri":"/v3/releases/puppetlabs-sshkeys_core-2.3.0","slug":"puppetlabs-sshkeys_core-2.3.0","module":{"uri":"/v3/modules/puppetlabs-sshkeys_core","slug":"puppetlabs-sshkeys_core","name":"puppetlabs-sshkeys_core",
line continues
line continues
line continues
puppet_report_time{name=\"File\",environment=\"production\",host=\"node.example.com\"} 0.32955574000000015\npuppet_report_time
so that's a request for puppet-rsyslog
and 2 modules merged in the same response:
Hi @johnwarburton,
thanks for your report, I'm gonna dig into this :)
Hi @dadav
I am stumbling around trying to understand the code - as a beginner to golang and concurrency handling with the chi router
I found an article describing how to use the "go race detector", so I compiled gorge
with
$ go build -race
And a lot of race conditions were found in miffleware/proxy
like
WARNING: DATA RACE
Write at 0x00c0008e85b8 by goroutine 3355:
encoding/json.appendString[go.shape.string]()
.
.
.
github.com/dadav/gorge/internal/middleware.(*capturedResponseWriter).sendCapturedResponse()
/export/home/jwarburt/puppet-forge/gorge/internal/middleware/proxy.go:37 +0x139
github.com/dadav/gorge/cmd.init.func2.3.ProxyFallback.3.1()
/export/home/jwarburt/puppet-forge/gorge/internal/middleware/proxy.go:87 +0x324
I did naively try and add some mutex handling for the capturedResponseWriter
structure, but I am still seeing similar race warnings in the same piece of code
type capturedResponseWriter struct {
http.ResponseWriter
body []byte
status int
mu sync.Mutex // Add a mutex to protect shared resources
}
func (w *capturedResponseWriter) WriteHeader(code int) {
w.mu.Lock()
defer w.mu.Unlock()
w.status = code
}
func (w *capturedResponseWriter) Write(body []byte) (int, error) {
w.mu.Lock()
defer w.mu.Unlock()
w.body = body
return len(body), nil
}
func (w *capturedResponseWriter) sendCapturedResponse() {
w.mu.Lock()
defer w.mu.Unlock()
w.ResponseWriter.WriteHeader(w.status)
w.ResponseWriter.Write(w.body)
}
I think this all started when the http server was put into its own go routing at https://github.com/dadav/gorge/blob/main/cmd/serve.go#L281
I will keep digging tomorrow
Hi @dadav
The race conditions reported (attached in gorge-race.txt) look to be from writing to the same http.ResponseWriter
variable w
at
I have tried many ways to isolate the write. It works fine in proxy.go
as there is a clear write happening at w.ResponseWriter.Write(w.body)
which I wrapped in sendCapturedResponse
in ProxyFallback
:
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
// Wrap the response writer with the synchronized response writer
syncedWriter, ok := w.(*SyncedResponseWriter)
if !ok {
syncedWriter = &SyncedResponseWriter{ResponseWriter: w}
}
capturedResponseWriter := &capturedResponseWriter{ResponseWriter: syncedWriter}
next.ServeHTTP(capturedResponseWriter, r)
where some (truncated for readability) middleware SyncedResponseWriter
did the locking:
type SyncedResponseWriter struct {
http.ResponseWriter
mu sync.Mutex
}
func (w *SyncedResponseWriter) Write(b []byte) (int, error) {
w.mu.Lock()
defer w.mu.Unlock()
return w.ResponseWriter.Write(b)
}
But the EncodeJSONResponse
function in routers.go
has the write happening at this line return json.NewEncoder(w).Encode(i)
where there is no clear write happening - in the race file, we can see it calling json
encoding libraries where they are actually doing the write. So it is unclear how to lock the write here as this did not work
syncedWriter, ok := w.(*customMiddleware.SyncedResponseWriter)
if !ok {
syncedWriter = &customMiddleware.SyncedResponseWriter{ResponseWriter: w}
}
.
.
.
if i != nil {
return json.NewEncoder(syncedWriter).Encode(i)
}
Going to extremes around this code putting in mutexes it will deadlock the code
Earlier in the week, going overboard with the mutexes in proxy.go
and the code worked well with no race conditions, but proxying failed completely
I am really out of my depth go
wise here. A lot of the working out of the problem came from a go
developer at work and the mutex code came from AI coding assistants
Any hints on where to go next would be appreciated
Thanks
John
Hi @dadav
I am migrating our CI pipeline to start using gorge and it looks like I am yet again seeing "token" errors similar to https://github.com/dadav/gorge/issues/4. Our CI system will spin up a fresh new container with an empty puppet modules directory requiring a full
r10k
installI can reproduce with my test script with an empty modules and
r10k
cache directoriesI am running against the tests against a fully primed cache on a
gorge
server via a k8s nginx ingress running a single replica. The client is running the latestr10k
I have also run these tests on bare metal hosts with plenty of network, CPU and memory capacity. I still get similar errors, so I do not feel the infrastructure influences this bug
There are no errors either from the
gorge
server nor thenginx
ingress logsI can reproduce the issue at will (script attached). The higher the parallelism and higher the
r10k
concurrency, the sooner you see errors. However I can still see issues with parallelism set at 1 andr10k
concurrency at 10. I am the only client hitting thisgorge
server, so it does not look load relatedIf there is anything I can do to help pinpoint the issue, please let me know
Regards
John gorge-issue.tgz