crossplane / terrajet

Generate Crossplane Providers from any Terraform Provider
https://crossplane.io
Apache License 2.0
289 stars 38 forks source link

Discuss filesystem locks and the need for repeated calls to get the operation status #36

Closed muvaf closed 2 years ago

muvaf commented 2 years ago

What problem are you facing?

During the initial implementation, in order to move fast and see our limits we deferred some of the discussion about the implementation and how we should do the bookkeeping was one of them. Currently, we manage the state and bookkeeping using filesystem. However, we also interact with Terraform's lock and state files, which makes the code complex to understand for readers and we don't really need to keep that information out of memory since the crash of the process means a restart of the Pod anyway. Additionally, there are some opportunities to merge some of the functionality in the code to get a simpler flow that is easier to read and change.

How could Terrajet help solve your problem?

I suggest that we use a global map that does the bookkeeping of the running CLI executions. Roughly, I think we need the following improvements in the initial implementation. Please feel free to add more things that we had deferred to after PoC phase. cc @ulucinar @turkenh

Note that some of the technical debt is accrued on purpose because we needed to be able to work in parallel during PoC and discover things from scratch. Before declaring terrajet as prod-ready and solidify the implementation with more tests, I think it's time to revisit the parts that we completed during that phase and refactor into a more cohesive structure because thanks to that initial effort we know what works, how Terraform behaves and the rough performance implications of the decisions.

muvaf commented 2 years ago

With https://github.com/crossplane/crossplane-runtime/pull/283 , generic reconciler doesn't call Create even if we say the resource does not exist for given grace period. I've reduced the grace period to 0 but then it re-queues immediately after every Create call until we report that resource exists since it assumes that you'd never call Create again. This causes excessive re-queueing until the resource is ready, which could take more than a minute.

Until we lift the requirement that the controller needs to call Apply again to get the results, we can't update to the new controller-runtime. Observe should be able to say that the resource exists until the creation is completed, similar to RDS DB Instance and other resources whose creation is a long-running operation.

turkenh commented 2 years ago

I agree that we should simplify the flow considering well-known environment we know our controllers running:

So, I don't see much benefit in trying to keep things synced with filesystem locks.

muvaf commented 2 years ago

Since it'll affect our planning for the next sprint, I created https://github.com/crossplane-contrib/terrajet/issues/74 after getting the thumbs-up from @turkenh

muvaf commented 2 years ago

Closing this, let's add comments to https://github.com/crossplane-contrib/terrajet/issues/74 for further discussion.