snowflake provider not scalable (snowflake rate limiting)

pwnage101 commented 5 years ago

We have a terraform project consisting of about 130 resources (including users, roles, databases, schemas, all sorts of grants, etc.) and we're bumping into an issue with what appears to be snowflake rate limiting while running terraform plan.

Running terraform plan no longer works because while it is refreshing state, it will suddenly stop at a random resource (nondeterministic) and hang there forever until I am forced to interrupt it. I have seen it hang on a single resource overnight!

$ /opt/terraform/terraform-0.12.7 plan 
Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but will not be
persisted to local or remote state storage.

snowflake_role.tableau_service_role: Refreshing state... [id=TABLEAU_SERVICE_ROLE]
snowflake_warehouse.tableau_service: Refreshing state... [id=TABLEAU_SERVICE]
snowflake_role.all_human_user_roles["<redacted>"]: Refreshing state... [id=<redacted>_ROLE]
snowflake_role.stitch_loader_role: Refreshing state... [id=STITCH_LOADER_ROLE]
snowflake_role.all_human_user_roles["<redacted>"]: Refreshing state... [id=<redacted>_ROLE]
snowflake_role.all_human_user_roles["<redacted>"]: Refreshing state... [id=<redacted>_ROLE]
snowflake_role.all_human_user_roles["<redacted>"]: Refreshing state... [id=<redacted>_ROLE]
[...SKIP 76 LINES...]
snowflake_database_grant.prod_readwrite_usage: Refreshing state... [id=PROD|||USAGE]
snowflake_warehouse_grant.transforming_usage: Refreshing state... [id=TRANSFORMING|||USAGE]
snowflake_user.dbt_trasformer: Refreshing state... [id=DBT_TRASFORMER]
snowflake_user.all_human_users["<redacted>"]: Refreshing state... [id=<redacted>]
snowflake_user.all_human_users["<redacted>"]: Refreshing state... [id=<redacted>]
snowflake_user.all_human_users["<redacted>"]: Refreshing state... [id=<redacted>]
^CInterrupt received.
Please wait for Terraform to exit or data loss may occur.
Gracefully shutting down...
^CTwo interrupts received. Exiting immediately. Note that data
loss may have occurred.

Error: operation canceled

However, running a specific target succeeds, presumably because rate limiting was not encountered:

$ /opt/terraform/terraform-0.12.7 plan  -target=snowflake_role_grants.tableau_service_user_role_grant
Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but will not be
persisted to local or remote state storage.

snowflake_role.tableau_service_role: Refreshing state... [id=TABLEAU_SERVICE_ROLE]
snowflake_warehouse.tableau_service: Refreshing state... [id=TABLEAU_SERVICE]
snowflake_user.tableau_service: Refreshing state... [id=TABLEAU_SERVICE]
snowflake_role_grants.tableau_service_user_role_grant: Refreshing state... [id=TABLEAU_SERVICE_ROLE]

------------------------------------------------------------------------

No changes. Infrastructure is up-to-date.

This means that Terraform did not detect any differences between your
configuration and real physical resources that exist. As a result, no
actions need to be performed.

As a workaround, I am able to refresh and apply all 130 resources this way, one at a time. However, this severely limits the usability of terraform, and precludes our ability to use tools such as Atlantis.

Does anybody else run medium size snowflake projects with 100+ resources? Do you run into rate limiting?

ryanking commented 5 years ago

I would suspect that this is a bug in our code, not a rate limit. We should get explicit errors when rate limits are hit.

Can you run it again with the TF_LOG=debug env variable and post the output (with anything sensitive redacted)?

pwnage101 commented 5 years ago

Here's the full redacted output of terraform refresh up until the point that it hangs:

https://gist.github.com/pwnage101/d46f002861eefd55a8ec6909636330e3#file-logs-redacted-txt

Edit: moved the output into a gist for better scrolling on this issue.

ryanking commented 5 years ago

thanks @pwnage101 – I am digging through the log but nothing obvious yet.

Can you try running terraform refresh again with -parallelism=1. That might help us narrow down the problem to a particular resource or resource type.

Reference: https://www.terraform.io/docs/commands/refresh.html#parallelism-n

pwnage101 commented 5 years ago

I'm not currently at my work laptop anymore, but I will at least say that running it several times with -parallelism=1 it will still apparently halt at a random resource. However, I didn't take a note of the resource types, which I will check tomorrow.

I'll also say this did happen in the middle of applying changes, which resulted in an endless loop of 3 resources retrying to be created.

pwnage101 commented 5 years ago

I ran it three more times and redacted the output:

Careful while scrolling, all three logs are on the same page.

In the case of logs.3.redacted.txt, I ran against the specific resource which it appears to be halting at, and it succeeded. In the other two logs, search for "Interrupt received" to see where I pressed ^C.

ryanking commented 5 years ago

@pwnage101 Thanks for the logs, I am digging through and trying to see what I can find.

For the resources that are in a loop, are they in the state file or are these new resources?

pwnage101 commented 5 years ago

For the resources that are in a loop, are they in the state file or are these new resources?

Are you referring to when I had 3 resources attempting to be created endlessly? That only happened once, and those console logs are long gone. I think terraform apply was trying to create like 90 new resources on snowflake (the initial refresh succeeded, somehow) and managed to create most of them successfully, but before the end it got stuck on 3 of them and the logs kept printing them over and over again with an incrementing duration. I terminated it at 10 minutes.

pwnage101 commented 5 years ago

I did some interative debugging by adding a bunch of debug statements, and discovered that the hang always happens on this line (while calling rows.Next()):

https://github.com/chanzuckerberg/terraform-provider-snowflake/blob/439f6f55/pkg/resources/grant_helpers.go#L142

It usually goes through some, but not all iterations, then calling rows.Next() again will cause the hang. I have not yet dug any deeper (i.e. I have not added any debug print statements to sqlx yet).

pwnage101 commented 5 years ago

It seems like once this hung state is encountered, all things that utilize db in any way is hung. I tried adding a 10 second timeout and force db.Ping(), but even Ping hangs!

Only db.Close() succeeds, but then I can't easily re-open it in a way that persists to subsequent resources since all of that logic is tucked away in ConfigureProvider() with no easy way to update the meta interface outside of it.

In case you're curious, or if it helps, I'm working off of this debug branch: https://github.com/pwnage101/terraform-provider-snowflake/commit/085ec9a563804bfb9c3ff14b00d54f3fffdb4a99

pwnage101 commented 5 years ago

I got more info by invoking db.Stats() during a hang:

db.Stats(): {MaxOpenConnections:0 OpenConnections:18 InUse:18 Idle:0 WaitCount:0 WaitDuration:0s MaxIdleClosed:0 MaxLifetimeClosed:0}

I'm not really familiar with the database/sql module internals, but based on this output it would seem like all 18 out of 18 connections are "InUse" which might explain the hang if that means there are none available? Is this some sort of connection leak?

When it isn't hanging, it usually prints 17/18:

db.Stats(): {MaxOpenConnections:0 OpenConnections:18 InUse:17 Idle:1 WaitCount:0 WaitDuration:0s MaxIdleClosed:0 MaxLifetimeClosed:0}

pwnage101 commented 5 years ago

I was able to fix the leak by adding rows.Close() before this line:

https://github.com/chanzuckerberg/terraform-provider-snowflake/blob/master/pkg/resources/schema.go#L268

But unfortunately that was a red herring since I'm still getting hangs.

pwnage101 commented 5 years ago

I also just re-tested with a terraform that I just built based off of the master branch, and also with the latest release of go:

 % TF_LOG=DEBUG terraform --version                                                                                                 
2019/09/06 16:25:14 [INFO] Terraform version: 0.12.9 dev 
2019/09/06 16:25:14 [INFO] Go runtime version: go1.13
...

Still no fix.

ryanking commented 5 years ago

@pwnage101 thanks for this! This gives me some clues to work off of.

pwnage101 commented 5 years ago

Another clue: I recompiled the provider with debug symbols, then attached GDB while it was hanging:

#0  runtime.epollwait () at /usr/lib/go-1.13/src/runtime/sys_linux_amd64.s:673
#1  0x000000000042cb70 in runtime.netpoll (block=true, ~r1=...) at /usr/lib/go-1.13/src/runtime/netpoll_epoll.go:71
#2  0x00000000004363e5 in runtime.findrunnable (gp=0xc000048000, inheritTime=false) at /usr/lib/go-1.13/src/runtime/proc.go:2372
#3  0x00000000004370be in runtime.schedule () at /usr/lib/go-1.13/src/runtime/proc.go:2524
#4  0x0000000000437ae6 in runtime.goexit0 (gp=0xc000461c80) at /usr/lib/go-1.13/src/runtime/proc.go:2727
#5  0x000000000045adeb in runtime.mcall () at /usr/lib/go-1.13/src/runtime/asm_amd64.s:318
#6  0x000000000045ad04 in runtime.rt0_go () at /usr/lib/go-1.13/src/runtime/asm_amd64.s:220
#7  0x0000000000000000 in ?? ()

pwnage101 commented 5 years ago

I also created this minimal example, which failed to reproduce the bug. It invokes a "SHOW GRANTS ON SCHEMA ..." statement and increments a counter idx. The idx counter just counted upwards forever (makes it to 1000+, whereas I expect it to fail before 100).

https://gist.github.com/pwnage101/216ae485db998601679a7d9659fe0ce3#file-minimal_example-go

The goal was to test the theory that there's a query count limit somewhere in the gosnowflake/sqlx/sql/API stack, but it seems there is not.

pwnage101 commented 5 years ago

ALMOST GOT IT!

Of the many random changes I made, this one seems to fix the problem:

diff --git a/pkg/db/db.go b/pkg/db/db.go
index b3feb1c..97098d2 100644
--- a/pkg/db/db.go
+++ b/pkg/db/db.go
@@ -4,6 +4,7 @@ import (
        "context"
        "database/sql"
        "fmt"
+       "log"
        "regexp"

        "github.com/ExpansiveWorlds/instrumentedsql"
@@ -15,7 +16,7 @@ func init() {

        logger := instrumentedsql.LoggerFunc(func(ctx context.Context, msg string, keyvals ...interface{}) {
                s := fmt.Sprintf("[DEBUG] %s %v\n", msg, keyvals)
-               fmt.Println(re.ReplaceAllString(s, " "))
+               log.Println(re.ReplaceAllString(s, " "))
        })

        sql.Register("snowflake-instrumented", instrumentedsql.WrapDriver(&gosnowflake.SnowflakeDriver{}, instrumentedsql.WithLogger(logger)))

I still have to downgrade terraform from git master to the lastest release and test it again, but this seems really promising.

Snowflake-Labs / terraform-provider-snowflake

snowflake provider not scalable (snowflake rate limiting) #69