gaia-pipeline / gaia

Build powerful pipelines in any programming language.
Apache License 2.0
5.2k stars 245 forks source link

Do a trace and pprof run on Gaia #205

Open Skarlso opened 5 years ago

Skarlso commented 5 years ago

Actually, I know there are more immediate issues out there which need attention, but man I ran into a race condition with the tests again today. So I'm a bit afraid that we should optimise the routines a bit.

michelvocks commented 5 years ago

Yeah, sounds like a good idea. Especially the scheduler is critical and could potentially have some leaking routines.

Skarlso commented 5 years ago

Aye. I agree. I'll get into it, once the worker rebuild is finished. I'm really eager to trace things. :)

Skarlso commented 5 years ago

Alright, found my first data race already. Going to work to fix this one first as it's the easiest to find and reproduce.

==================
WARNING: DATA RACE
Read at 0x00c0003a8768 by goroutine 97:
  bufio.(*Writer).Flush()
      /usr/local/Cellar/go/1.13/libexec/src/bufio/bufio.go:590 +0x79
  github.com/gaia-pipeline/gaia/plugin.(*GoPlugin).FlushLogs()
      /Users/hannibal/goprojects/gaia/gaia/plugin/plugin.go:322 +0x50
  github.com/gaia-pipeline/gaia/workers/scheduler.(*Scheduler).executeScheduler.func1()
      /Users/hannibal/goprojects/gaia/gaia/workers/scheduler/scheduler.go:556 +0x178

Previous write at 0x00c0003a8768 by goroutine 26:
  bufio.(*Writer).Write()
      /usr/local/Cellar/go/1.13/libexec/src/bufio/bufio.go:638 +0x415
  github.com/hashicorp/go-plugin.(*Client).logStderr()
      /Users/hannibal/golang/pkg/mod/github.com/hashicorp/go-plugin@v0.0.0-20180331002553-e8d22c780116/client.go:759 +0x321

Goroutine 97 (running) created at:
  github.com/gaia-pipeline/gaia/workers/scheduler.(*Scheduler).executeScheduler()
      /Users/hannibal/goprojects/gaia/gaia/workers/scheduler/scheduler.go:551 +0x393
  github.com/gaia-pipeline/gaia/workers/scheduler.(*Scheduler).executeScheduledJobs()
      /Users/hannibal/goprojects/gaia/gaia/workers/scheduler/scheduler.go:504 +0xd7
  github.com/gaia-pipeline/gaia/workers/scheduler.(*Scheduler).prepareAndExec()
      /Users/hannibal/goprojects/gaia/gaia/workers/scheduler/scheduler.go:222 +0x11c8
  github.com/gaia-pipeline/gaia/workers/scheduler.(*Scheduler).work()
      /Users/hannibal/goprojects/gaia/gaia/workers/scheduler/scheduler.go:147 +0x1b0

Goroutine 26 (running) created at:
  github.com/hashicorp/go-plugin.(*Client).Start()
      /Users/hannibal/golang/pkg/mod/github.com/hashicorp/go-plugin@v0.0.0-20180331002553-e8d22c780116/client.go:554 +0x1723
  github.com/hashicorp/go-plugin.(*Client).Client()
      /Users/hannibal/golang/pkg/mod/github.com/hashicorp/go-plugin@v0.0.0-20180331002553-e8d22c780116/client.go:296 +0x63
  github.com/gaia-pipeline/gaia/plugin.(*GoPlugin).Init()
      /Users/hannibal/goprojects/gaia/gaia/plugin/plugin.go:169 +0xbac
  github.com/gaia-pipeline/gaia/workers/scheduler.(*Scheduler).prepareAndExec()
      /Users/hannibal/goprojects/gaia/gaia/workers/scheduler/scheduler.go:206 +0xbf2
  github.com/gaia-pipeline/gaia/workers/scheduler.(*Scheduler).work()
      /Users/hannibal/goprojects/gaia/gaia/workers/scheduler/scheduler.go:147 +0x1b0
==================
Skarlso commented 5 years ago

@michelvocks At the top of your head, do you know the reason for implementing Plugin with pointers? I saw no reason in particular. Plugins aren't shared as far as I can see.

The Flush operation is causing a race condition through the scheduler. Apparently more than one routine at the same time can't flush. This is the most obvious race in Gaia.

Using values would fix it since plugins wouldn't be shared through out gaia. I don't know yet what that would mean in terms of memory usage though. I haven't tested it yet. I plan on doing so. :)

michelvocks commented 5 years ago

@Skarlso Yes. One plugin instance is always one plugin run. We need to have an instance since we need to run multiple jobs synchronously via this instance. If we don't have a plugin instance, we would have to start the plugin for every pipeline job which is very insufficient.

Additionally, a user potentially wants to share data/information between jobs within a pipeline run. If we create a new plugin instance for every job, you couldn't share in-memory data between jobs and you are literally forced to write information to disk or to a remote cache/database which is more work.

Skarlso commented 5 years ago

That could be achieved by giving back a modified version of the struct, instead of passing it around as pointers, not?

michelvocks commented 5 years ago

I don't know how but maybe we are talking past each other. Could you elaborate a bit more on the problem/solution?

In my opinion, we could easily add a mutex to the Plugin struct and add a lock to the FlushLogs method?

Skarlso commented 5 years ago

Adding a mutex is a patch. :) That's like using a buffered channel instead of a channel because that appears to be solving a problem.

I'm trying to eliminate rather the possibility of race, instead of putting a mutex on it.

Of course a mutex would be easy. But that's like a leaking boat, and putting a patch on the leak, instead of trying to fix the boat, I think.

On 2019. Sep 9., at 16:11, Michel Vocks notifications@github.com wrote:

I don't know how but maybe we are talking past each other. Could you elaborate a bit more on the problem/solution?

In my opinion, we could easily add a mutex to the Plugin struct and add a lock to the FlushLogs method?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/gaia-pipeline/gaia/issues/205?email_source=notifications&email_token=AABMUQSE3WYESUENXIHMYETQIZKRZA5CNFSM4IPKCT42YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6HW7ZY#issuecomment-529493991, or mute the thread https://github.com/notifications/unsubscribe-auth/AABMUQUTQ645QQJFPWXAPZTQIZKRZANCNFSM4IPKCT4Q.

Skarlso commented 5 years ago

That said, it can occur that a mutex IS the solution. But I would like to make sure, I'm not just patching a sinking boat. :)

Skarlso commented 5 years ago

I'm gonna show you in a draft pr what I meant and like'd your opinion on it. :)

Skarlso commented 5 years ago

Ah, yes. It's way too much work and ends up being ugly. Mutex it is I guess. :)

Skarlso commented 5 years ago

Hah no, I succeeded, but get this. Because of other things, like Job and such, I still got a data race at flush! But this is a good thing. Because now I verified that indeed a mutex is a good approach at this point. :) Awesome.

Skarlso commented 5 years ago

Get this! :) Just so I could ease my mind, I got further and re-wrote a large chunk of passing around the plugin to values instead of pointers. And when there is a mutation, I pass the mutated value back instead of the pointer semantics. I got rid of the race. :) However, I don't like the resulting code that much. It's not intuitive enough and sometime it passes it back sometimes it doesn't. So I'm not going to go with that. Not to mention that I don't have data about the memory allocation increase because of the copy. Aw well. 🤷‍♂

Skarlso commented 5 years ago

Huh, so guess what... The mutex didn't solve the data race.

Skarlso commented 5 years ago

Oh I know why.

Skarlso commented 5 years ago

Well this is problematic. Gaia is racing with go-plugin's logStderr. c.config.Stderr.Write([]byte(line)) since we have stderr replaced with our buffer. We can't just lock the flush. The flush is racing with go-plugin. Eh.

Skarlso commented 5 years ago

So the flush happens so that the logs are displayed in the log window, if I see this correctly. But the plugin is also trying to write / read to the buffer we are providing it. This is problematic, because even if we lock on our end the plugin will not lock of course, because it's using a simple Write. Which means a data race on the buffer.

Skarlso commented 5 years ago

Okay, so the solution must be that we implement our own writer using an underlying buffer and io writer. On which the Write is basically a bufio.Write but with a lock in it. And we then pass that to go-plugin.