ManageIQ / manageiq-automation_engine

Automation engine for ManageIQ
Apache License 2.0
11 stars 74 forks source link

[StateVarHash] Prevent purging #478

Closed NickLaMuro closed 2 years ago

NickLaMuro commented 3 years ago

The BinaryBlob of the StateVarHash is purged via the following ActiveRecord query:

where(:resource => nil)

Which simply checks if the resource_id is nil. Since the BinaryBlob is destroyed when it is accessed in init_with, this on it's own will prevent it from being purged pre-maturely. However, adding a pseudo negative ID based on Time it can be purged based on date to avoid bloat without adding a created_at column but still avoid a bloat of BinaryBlob records from failed automate jobs.

Links

coveralls commented 3 years ago

Pull Request Test Coverage Report for Build 5881


Totals Coverage Status
Change from base Build 5849: 0.01%
Covered Lines: 5057
Relevant Lines: 5897

💛 - Coveralls
Fryguy commented 3 years ago

Is there no actual object to tie the resource to? Like the provision request?

Fryguy commented 3 years ago

reason I'm asking is that while the StateVarHash itself can't know, it's possible the caller could know and could tell the StateVarHash via some accessor.

Fryguy commented 3 years ago

Oh I see - now that I'm digging in this has nothing to do with provisioning per se, but more to do with how the automate engine works. Perhaps we can tie it the domain or perhaps the persisted workspace?

NickLaMuro commented 3 years ago

@Fryguy There are a lot of touchpoints where MiqAeEngine.deliver and friends get called, which eventually will call the MiqAeWorkspaceRuntime. The runtime doesn't actually have any persistence available to it, so it would have to be tied to something that spawned the job, which can be any number of things.

This was a surgical fix, where what you are suggesting probably would require ripping open a bunch of classes and updating methods so that we could pass in the ids properly. I did consider it before coming up with this approach, but I think this option is cleaner and simpler, because as I mentioned in the PR, the StateVarHash does eventually delete itself:

https://github.com/ManageIQ/manageiq-automation_engine/blob/631be2c255f7ab6e7b090b37b10529801b7938cf/lib/miq_automation_engine/engine/miq_ae_engine/state_var_hash.rb#L16-L22

So the BinaryBlob::Purging changes are just an extra safe guard.


Of note: StateVarHash is also a relatively new addition to the code base:

https://github.com/ManageIQ/manageiq-automation_engine/pull/405

And it is unclear what it is solving besides avoiding a large MiqQueue entry perhaps.

chessbyte commented 3 years ago

For the Automate API approach (a spike we did years ago), I recall us saving the MiqAeWorkspaceRuntime object in SQL so that callers can query things via an API.

Fryguy commented 3 years ago

Yes, MiqAeWorkspaceRuntime is serializable to the database, so, the mechanism exists, but, I'm not sure it includes the StateVarHash. That being said, I think it's only serialized right before we call out to a method (so that the method can call back into the workspace if needed to get/set things). However, I think you can only call get/set state vars during methods anyway, so when that happens you always have a serialized workspace.

I think the reason the StateVarHash is serialized to BinaryBlob specifically is because users can technically store whatever they want in there, such as random Ruby objects, so it's not necessarily JSON/YAML serializable, whereas BinaryBlob can serialize Marshal'd objects.

This was a surgical fix, where what you are suggesting probably would require ripping open a bunch of classes and updating methods so that we could pass in the ids properly.

Oh yes, I understand that, and I'll probably be ok with that. When I see a surgical fix I generally like to also understand at that time what the long term fix actually is (if there is one), and the level of effort to getting there.

Fryguy commented 3 years ago

So the BinaryBlob::Purging changes are just an extra safe guard.

Instead of hacking in something specific to one use case, would perhaps it be better to make it more generalized? That is, when purging binary blobs, only purge orphaned that are also over a day old? Then we don't have to put fake classes into resource_type and a negative id into resource_id.

Fryguy commented 3 years ago

Another option could be a time_to_live/ttl column (or maybe purge_after) that holds a value saying how long to keep it around for. While I don't really like the hacking in general, I think the only part that actually bothers me with it is that we are storing a value with meaning into the resource_id column (i.e. a negative time value as an integer into something that is supposed to hold a foreign key reference).

chessbyte commented 3 years ago

Sorry for my ignorance, but is there a bug we are trying to fix or is this general code improvement?

Fryguy commented 3 years ago

Actual bug. Since the StateVarHash-based BinaryBlob does not have a resource_id it is subject to purging the moment it is created. If timed incorrectly, things like provisioning will fail in the middle when the BinaryBlob is purged out from underneath it.

See also https://github.com/ManageIQ/manageiq-automation_engine/pull/405 and https://github.com/ManageIQ/manageiq-automation_engine/pull/406

NickLaMuro commented 3 years ago

@chessbyte Sorry, I will add this to the OP, but this is fixing this issue:

https://github.com/ManageIQ/manageiq/issues/21351

and related talk.manageiq.org question/issue.

NickLaMuro commented 3 years ago

Instead of hacking in something specific to one use case, would perhaps it be better to make it more generalized?

So the "hack" is something that could be done safely as a generalized backport, since #405 is now part of jansa and everything after. A migration wouldn't be something we could easily do for that case. I think the same even goes for trying to reference MiqAeWorkspaceRuntime in some fashion, because it would be a lot of touch points, and it is possible to get merge conflicts when going farther back, causing risk in backporting.

This "hack", while I 100% agree with this statement:

While I don't really like the hacking in general, I think the only part that actually bothers me with it is that we are storing a value with meaning into the resource_id column (i.e. a negative time value as an integer into something that is supposed to hold a foreign key reference).

It is an isolated change with a class that hasn't really changed much since it was introduced. I also don't know why #405 was introduced in the first place, since there is zero reference to any external resource in that PR, so in my opinion, we could also just revert that change and nothing would be different functionally speaking, we would just have one less bug (based on the information I have around it).

Besides a migration, the other solution is to include a :resource_type/:resource_id to MiqAeRuntimeWorlspace.new so it can assign that to he StateVarHash object's BinaryBlob. However, it seems that MiqAeRuntimeWorlspace.new multiple points of entry in this repo alone:

and I would be playing wack-a-mole trying to assign a proper :resource_id/:resource_type to it since it doesn't have a single point of entry I can latch on to (the list above is probably not exhaustive, but I gave up after 5).

Also, MiqAeRuntimeWorkspace might be serializable into the DB, but I don't know where that happens, and it clearly is handled by callers, and not by the class itself. So from MiqAeRuntimeWorkspace (which is a PORO - Plain Old Ruby Object), I have no way of accessing the database record type and id to be able to assign that to it... again, without updating a bunch of the method signatures above to pass down the record it is associated with.

The only other idea I have would be to use the user object that is passed into both MiqAeRuntimeWorkspace.instantiate and MiqAeRuntimeWorkspace.instantiate_with_user:

https://github.com/ManageIQ/manageiq-automation_engine/blob/5be34d7779dbd7e968326076c39aacca197fc59f/lib/miq_automation_engine/engine/miq_ae_engine/miq_ae_workspace_runtime.rb#L47-L60

Though, that is pretty vague, and I don't know how often that is actually unique, and it could just always be the same user every time.


Point being, I already gave this a crap ton of thought, and deemed it overly complex to go the non-hack route. If we want to move towards a ttl, I would be fine with that if do this PR as the first pass, and then we immediately create a followup that does the ttl approach. That way we have something to backport if needed, but can move forward with a more reasonable approach going forward.

Fryguy commented 3 years ago

@NickLaMuro Thanks for the thought process. I has assumed that the reason for this particular direction was because the runtime and state var hash have no way to access any sort of "parent" object, and thus there's nothing to tie it to, so that checks out.

What I think is the best path forward is to figure out how we want this to be properly modeled (e.g. ttl), then figure out what the backport strategy is and the upgrade path from that backported way to the new way. An upgrade from what you have to something like a ttl should be relatively easy, so that's good.

In thinking about this more, I don't really like the ttl specifically, because then it's on the StateVarHash to figure out a length, but in the end that value is rather arbitrary since the StateVarHash doesn't have context. In reading your commit message, you mentioned created_at columns. Perhaps that's just a better approach overall? Change BinaryBlob to purge orphaned but only over, say, a month?

NickLaMuro commented 3 years ago

Change BinaryBlob to purge orphaned but only over, say, a month?

@Fryguy Yeah, 1.month was what I went with over in https://github.com/ManageIQ/manageiq/pull/21390 :

https://github.com/ManageIQ/manageiq/pull/21390/files#diff-e769bbb8c1ba3711c5403b424ed9c218ffafba7f1890ee394717196f28ff4540R23

Adding a created_at column and then switching this and ManageIQ/manageiq to that is a very easy switch from what I have here, so I think that is doable. Should I throw some follow up PRs together to do that if we are good with that approach?

Fryguy commented 3 years ago

Should I throw some follow up PRs together to do that if we are good with that approach?

Yup! Then we can figure out the backport juggling with this version.

miq-bot commented 2 years ago

Checked commit https://github.com/NickLaMuro/manageiq-automation_engine/commit/cb7056d9f79936d8a9bc9c393863887adb3046a8 with ruby 2.6.3, rubocop 1.13.0, haml-lint 0.35.0, and yamllint 1 file checked, 0 offenses detected Everything looks fine. :cookie:

NickLaMuro commented 2 years ago

Looking into the test failures right now...

Update: Failing for me on master, and my guess is it is related to another change in a different repo, so I will do some investigating.

Fryguy commented 2 years ago

Backported to morphy in commit 99001199af42580e4fc1b7a985fc42a276efe590.

commit 99001199af42580e4fc1b7a985fc42a276efe590
Author: Jason Frey <fryguy9@gmail.com>
Date:   Wed Oct 6 12:49:21 2021 -0400

    Merge pull request #478 from NickLaMuro/state-var-hash-purging-improvements

    [StateVarHash] Prevent purging

    (cherry picked from commit ba2c6dacc8a5ace8bac409d70e87d5f211c74d47)