Workflow Migrator doesn't handle broken objects well.

flipmcf commented 2 years ago

Biggish sites with some 'bad objects'.

When update_security_for is called during a workflow migration, and it hits a misbehaving object, the entire WorkflowChainUpdater context exits with an exception, doesn't report the exception, and leaves the user without a workflow update.

ERROR:ftw.upgrade:FAILED Change workflow states (GeneratorExit: ) at '/plone/broken_thing.html'

It's also a bit misleading, because in my case, it's a broken image under 'broken_thing.html' that is actually stopping the upgrade. you have to really dig in to find it. This was my issue underneath, but it could be anything:

obj.reindexObjectSecurity()
*** AttributeError: 'NamedBlobImage' object has no attribute 'reindexObject'

It was hard work and time was used tracking it down.

So, may I suggest we catch exceptions and log them inside helpers.update_security_for so upgrades can proceed to the end?

https://github.com/4teamwork/ftw.upgrade/blob/03bf28729c307ea5b6b14b572dabf4186452e4e5/ftw/upgrade/helpers.py#L31

if changed and reindex_security:
        try:
            obj.reindexObjectSecurity()
        except Exception as err:
            logger.warning(f"Can't update security for {obj.getPhysicalPath}: {err=}, {type(err)=}")

I've monkeypatched it like this in my specific updater for my specific product, - but maybe it's a good idea to send this enhancement upstream?

It still doesn't actually pin-down the real object that caused the problem, but at least we roll on to the end of the upgrade.

Bad news is, we leave a few objects without security updates, which may lead to bigger problems if the administrator doesn't address it manually.


WARNING:my_package upgrades:Can't update security for <bound method
 Traversable.getPhysicalPath of <NewsItem at
 /plone/broken_thing.html>>:
 err=AttributeError("'NamedBlobImage' object has no attribute
 'reindexObject'"), type(err)=<class 'AttributeError'>
`

lukasgraf commented 2 years ago

Thank you for the detailed issue!

I agree with your thinking here. ftw.upgrade is obviously opinionated in how it handles certain things, and thats mainly driven by our day-to-day experiences of upgrading large sites.

It seems to me like you're in the same boat: Large sites, which probably are large because they've been around for a while, and therefore are almost guaranteed to have some cruft and misbehaving objects.

We value correctness and atomicity when doing upgrades, but we don't enjoy working over the weekend because a site upgrade failed at 95% and rolled back a transaction that's been running for 24h ;-) So there's tradeoffs that need to be made.

In some cases a more robust solution with few downsides is easy to find. catalog_unrestricted_get_object() just uncatalogs broken brains if it encounters them for example.

In the case of workflow security updates, skipping an object could have bad consequences though, as you point out. I feel a bit uneasy about doing that by default. Because the problem with logging is that there's a significant chance that nobody is every going to see that - especially with big upgrades for large sites. If a warning just appears in the middle of a stream of thousands of lines, no one's gonna notice.

So a way to mitigate that could be to still skip, but collect errors as they happen and then log them at the very end of the upgrade, with a summary maybe. That way, an attentive user should have a reasonable chance to spot them.

Another option could be to introduce a keyword argument to update_security_for() that allows you to control whether you want it brittle or robust. That could even be used on a case-by-case basis, because when writing that upgrade step you probably know best what the implications of a skipped object are, and whether or not it's security critical that you get them all.

I'll bring this up with my team, since I'd like to hear some other people's takes, but I already agree that there's potential here to handle this a lot better.

BTW, quick note on your logging patch: You're not actually invoking obj.getPhysicalPath in your format string (missing parens), hence the <bound method.... But that would just make the message more concise, but still not lead you to the actual offending object (the image) I think. The strack trace might though.

I haven't tested this, but logger.exception("Can't update security ...") might just do the trick. In addition to the message, this will also log the stack trace of the last exception that was raised (formatted pretty much the same way as a plain old unhandled exception would be).

flipmcf commented 2 years ago

This is so very educational and I appreciate the time.

It's my opinion that most software should be strict by default, and then can be instructed to be permissive. So I think that the keyword argument to 'ignore security errors' is a good path, and defaulting to current behavior, which is to stop the upgrade.

collect errors Totally agree. Even better would be to generate some separate log, like an output file (but not really). In my use case, I am triggering an upgrade step from buildout using collective.recipe.plonesite and that logging is very noisy.

What would be perfect is another add-on that keeps a list of misbehaving, broken objects encountered during any kind of processing, then gives nice big, red warnings in the UI that asks a human to review these objects. Bake that feature into Plone core imho (I'll go write the PLIP).

At first glance, catalog_unrestricted_get_object() may not be a good answer. My specific use case encounters the problem during reindexObjectSecurity() which is Plone core. It's not the current obj that update_security_for is passed, but a child of that object. This makes the logging a bit awkward also, because ftw.upgrade will report that broken_thing.html is a problem, when actually it's broken_thing.html/image that is actually what's causing the upgrade to fail.

This package is incredibly useful and I've always felt comfortable using any addon with the 'ftw' prefix. I love what you folks do and how you approach your Pone sites. It's a pleasure, and honor, and a learning experience to work with you guys. Thank you for making this work open source.

4teamwork / ftw.upgrade

Workflow Migrator doesn't handle broken objects well. #223