Closed pwnage101 closed 5 years ago
I would suspect that this is a bug in our code, not a rate limit. We should get explicit errors when rate limits are hit.
Can you run it again with the TF_LOG=debug
env variable and post the output (with anything sensitive redacted)?
Here's the full redacted output of terraform refresh up until the point that it hangs:
https://gist.github.com/pwnage101/d46f002861eefd55a8ec6909636330e3#file-logs-redacted-txt
Edit: moved the output into a gist for better scrolling on this issue.
thanks @pwnage101 – I am digging through the log but nothing obvious yet.
Can you try running terraform refresh
again with -parallelism=1
. That might help us narrow down the problem to a particular resource or resource type.
Reference: https://www.terraform.io/docs/commands/refresh.html#parallelism-n
I'm not currently at my work laptop anymore, but I will at least say that running it several times with -parallelism=1
it will still apparently halt at a random resource. However, I didn't take a note of the resource types, which I will check tomorrow.
I'll also say this did happen in the middle of applying changes, which resulted in an endless loop of 3 resources retrying to be created.
I ran it three more times and redacted the output:
Careful while scrolling, all three logs are on the same page.
In the case of logs.3.redacted.txt
, I ran against the specific resource which it appears to be halting at, and it succeeded. In the other two logs, search for "Interrupt received" to see where I pressed ^C.
@pwnage101 Thanks for the logs, I am digging through and trying to see what I can find.
For the resources that are in a loop, are they in the state file or are these new resources?
For the resources that are in a loop, are they in the state file or are these new resources?
Are you referring to when I had 3 resources attempting to be created endlessly? That only happened once, and those console logs are long gone. I think terraform apply
was trying to create like 90 new resources on snowflake (the initial refresh succeeded, somehow) and managed to create most of them successfully, but before the end it got stuck on 3 of them and the logs kept printing them over and over again with an incrementing duration. I terminated it at 10 minutes.
I did some interative debugging by adding a bunch of debug statements, and discovered that the hang always happens on this line (while calling rows.Next()):
It usually goes through some, but not all iterations, then calling rows.Next() again will cause the hang. I have not yet dug any deeper (i.e. I have not added any debug print statements to sqlx yet).
It seems like once this hung state is encountered, all things that utilize db
in any way is hung. I tried adding a 10 second timeout and force db.Ping()
, but even Ping hangs!
Only db.Close()
succeeds, but then I can't easily re-open it in a way that persists to subsequent resources since all of that logic is tucked away in ConfigureProvider()
with no easy way to update the meta
interface outside of it.
In case you're curious, or if it helps, I'm working off of this debug branch: https://github.com/pwnage101/terraform-provider-snowflake/commit/085ec9a563804bfb9c3ff14b00d54f3fffdb4a99
I got more info by invoking db.Stats()
during a hang:
db.Stats(): {MaxOpenConnections:0 OpenConnections:18 InUse:18 Idle:0 WaitCount:0 WaitDuration:0s MaxIdleClosed:0 MaxLifetimeClosed:0}
I'm not really familiar with the database/sql module internals, but based on this output it would seem like all 18 out of 18 connections are "InUse" which might explain the hang if that means there are none available? Is this some sort of connection leak?
When it isn't hanging, it usually prints 17/18:
db.Stats(): {MaxOpenConnections:0 OpenConnections:18 InUse:17 Idle:1 WaitCount:0 WaitDuration:0s MaxIdleClosed:0 MaxLifetimeClosed:0}
I was able to fix the leak by adding rows.Close()
before this line:
But unfortunately that was a red herring since I'm still getting hangs.
I also just re-tested with a terraform that I just built based off of the master branch, and also with the latest release of go:
% TF_LOG=DEBUG terraform --version
2019/09/06 16:25:14 [INFO] Terraform version: 0.12.9 dev
2019/09/06 16:25:14 [INFO] Go runtime version: go1.13
...
Still no fix.
@pwnage101 thanks for this! This gives me some clues to work off of.
Another clue: I recompiled the provider with debug symbols, then attached GDB while it was hanging:
#0 runtime.epollwait () at /usr/lib/go-1.13/src/runtime/sys_linux_amd64.s:673
#1 0x000000000042cb70 in runtime.netpoll (block=true, ~r1=...) at /usr/lib/go-1.13/src/runtime/netpoll_epoll.go:71
#2 0x00000000004363e5 in runtime.findrunnable (gp=0xc000048000, inheritTime=false) at /usr/lib/go-1.13/src/runtime/proc.go:2372
#3 0x00000000004370be in runtime.schedule () at /usr/lib/go-1.13/src/runtime/proc.go:2524
#4 0x0000000000437ae6 in runtime.goexit0 (gp=0xc000461c80) at /usr/lib/go-1.13/src/runtime/proc.go:2727
#5 0x000000000045adeb in runtime.mcall () at /usr/lib/go-1.13/src/runtime/asm_amd64.s:318
#6 0x000000000045ad04 in runtime.rt0_go () at /usr/lib/go-1.13/src/runtime/asm_amd64.s:220
#7 0x0000000000000000 in ?? ()
I also created this minimal example, which failed to reproduce the bug. It invokes a "SHOW GRANTS ON SCHEMA ..." statement and increments a counter idx
. The idx
counter just counted upwards forever (makes it to 1000+, whereas I expect it to fail before 100).
https://gist.github.com/pwnage101/216ae485db998601679a7d9659fe0ce3#file-minimal_example-go
The goal was to test the theory that there's a query count limit somewhere in the gosnowflake/sqlx/sql/API stack, but it seems there is not.
ALMOST GOT IT!
Of the many random changes I made, this one seems to fix the problem:
diff --git a/pkg/db/db.go b/pkg/db/db.go
index b3feb1c..97098d2 100644
--- a/pkg/db/db.go
+++ b/pkg/db/db.go
@@ -4,6 +4,7 @@ import (
"context"
"database/sql"
"fmt"
+ "log"
"regexp"
"github.com/ExpansiveWorlds/instrumentedsql"
@@ -15,7 +16,7 @@ func init() {
logger := instrumentedsql.LoggerFunc(func(ctx context.Context, msg string, keyvals ...interface{}) {
s := fmt.Sprintf("[DEBUG] %s %v\n", msg, keyvals)
- fmt.Println(re.ReplaceAllString(s, " "))
+ log.Println(re.ReplaceAllString(s, " "))
})
sql.Register("snowflake-instrumented", instrumentedsql.WrapDriver(&gosnowflake.SnowflakeDriver{}, instrumentedsql.WithLogger(logger)))
I still have to downgrade terraform from git master to the lastest release and test it again, but this seems really promising.
We have a terraform project consisting of about 130 resources (including users, roles, databases, schemas, all sorts of grants, etc.) and we're bumping into an issue with what appears to be snowflake rate limiting while running
terraform plan
.Running terraform plan no longer works because while it is refreshing state, it will suddenly stop at a random resource (nondeterministic) and hang there forever until I am forced to interrupt it. I have seen it hang on a single resource overnight!
However, running a specific target succeeds, presumably because rate limiting was not encountered:
As a workaround, I am able to refresh and apply all 130 resources this way, one at a time. However, this severely limits the usability of terraform, and precludes our ability to use tools such as Atlantis.
Does anybody else run medium size snowflake projects with 100+ resources? Do you run into rate limiting?