control-center / serviced-isvcs

Builds and packages isvcs docker image for use with serviced
1 stars 3 forks source link

Serviced Fails to Commit Containers on ZFS Docker Backend #105

Open sempervictus opened 2 years ago

sempervictus commented 2 years ago

We've found that under heavy load (>100 zenpack installs), the default XFS/thin-pool bit doesn't fly - the XFS filesystems end up corrupted beyond repair by time time all of them are deployed, or shortly after the system starts working due to background churn in serviced/docker. We're a big ZFS shop, docker uses ZFS, and when using ZFS for the Docker driver under serviced, the issue describe above goes away. However, we're seeing failure to commit messages coming from serviced, despite the changes showing up as intended across reboots and such. Not sure what sort of terrifying implications this has for long term use - so figured someone here might know if this an actual noop message or if something needs to be done in the sources to accommodate for the improved FS underneath it (i'm not big on Go, can read/write but try to avoid it when possible - Rust is the future :-p).

In a potentially related concern to the XFS corruption, i think serviced might be doing something untoward when copying data as i've rebuild 6.3 ~30X in the last week or so (hours at a time with all those zenpacks, devops and all), and find that there are relatively common crashes of shared libraries loaded by py2.7. Id say "bad images" except its different libs on each install. Make me wonder if sync IO isn't respected or something of that nature.

dougsyer commented 2 years ago

I pretty frequently batch install many zenpacks and it typically works but i tend to start with things like multirealm, thresholds, zenpacklib, library packs like WSMAN, layer2, and custom roles zenpacks first then I do the big ones like vsphere, stoage zenpacks, windows, cisco..essentially working from the core out until we are doing the point zenoss and our own zenpacks that dont have as many cross integrations.

There are some packs that dont play nice with others like the lync zenpack which we dont install anymore. Thats usually due to a mistake in how relations are handled in the ZP.

Its also possible and ive seen it ..for a zenpack to install when it should have failed because it doesnt check if the zope and python can import after the install but before the commit. Its also hard to know if a zenpack failed like that because you are screen scraping for tracebacks on the next install and not the current one

I havent seen corruption during installs but we have seen issues with maria if nvme driver resets due to a timeout. We have some big (several TB volumes).

I do see freeze problems esp lately when things are really busy it appears to be some kind of storage related high wait time issue and it can lead so some issues on virtual machines with the clock shooting forward and triggeting timeouts on nvme but i dont see that on zp issues but it can happen on backups esp incrementals lately.

We use standard file system setup though.

dougsyer commented 2 years ago

And to be fair with zenoss we have had some vendor related storage issues lately but maria on dfs/nfs isnt super resilient although the snapshot restores work like a charm.

sempervictus commented 2 years ago

Maria (and any transactional DB) are one of the major considerations in the design of ZFS from the get-go - its a transactional fileystem with atomic semantics so its pretty hard to get half-writes from a DB into a snapshot (even the O_DIRECT WAL is written out to the ZFS SLOG so there's an analog to that built-in). Our zenoss 4 instances have been up for ages like this, have (logically) massive databases, but have great performance since the compression ratio is like 10:1 in some cases providing a 10:1 IO reduction for DB scans and such.

Far as ZenPacks - we have an in-house Chef cookbook which deploys everything for us (ported from our zenoss 3-4 days to the new paradigm), and there's a vetted zenpack set (~100) which we know work together pretty well. Same approach to deployment with dependencies first, then actual device/platform-specific packs which rely on the first set. The cookbook actually builds any source-based zenpack in the docker container for that process prior to installation as we've found that post v5, zenoss does not play well with later changes to linked zenpacks (whereas eggs are easy enough to manage or remove/replace if needed).

The 100 zenpack thing becomes a problem by the way - if you do it with per-pack installation, you hit the docker limit of snapshots for the container and you can't make any more changes to the container at all (even removing packs) as snapshots wont "stick." So we use the batch installation method now where Chef generates the relevant autoinstall file and places all the eggs it built into an appropriate directory in the chroot in order to avoid that whole snapshot limit fiasco.

dougsyer commented 2 years ago

Yeah i have run into that wall before with the snapshots plus hot fixed we apply and we have a procedure from their support to consolidate the layers that we have had to use.

dougsyer commented 2 years ago

Yeah as km sure you found out the zenpack command on the zope service doesnt launch docker with the correct config/command line to mount the linked zenpacks directory so any zenpack action that tried to load those custom classes (like a build relations or catalog rescan) will cause a spectacular and sometimpes it has painful consequences.