apache / solr-operator

Official Kubernetes operator for Apache Solr
https://solr.apache.org/operator
Apache License 2.0
243 stars 111 forks source link

Slow restart of Solr Pod init_container cp-solr-xml chown #537

Closed smoldenhauer-ish closed 1 year ago

smoldenhauer-ish commented 1 year ago

Solr Operator 0.6.0 In our deployments a restart of Solr Pods will take minutes up to half an hour. The init container cp-solr-xml with the command cp /tmp/solr.xml /tmp-config/solr.xml && chown -R 8983:8983 /var/solr/data/backup-restore/local-collection-backups-1 esp. the recursive chown seems to be the cause. We mitigated it a bit by cleaning up the backup volume regularly. But there are deployments with a larger number of collections (~400) and then it is still slow. ... The backup repository is a azurefile-csi storage.

mailing list discussion: https://lists.apache.org/thread/tlxg38v6y81dhykycrgqw77mm2xdoqbr

smoldenhauer-ish commented 1 year ago

basic idea is to skip the chown if the backup directory is already writable for the solr user.

HoustonPutman commented 1 year ago

Yeah, sounds like a good improvement! Looks like you are testing out some options, so make a PR when you are ready.

smoldenhauer-ish commented 1 year ago

The changed statement is working now. I'm currently waiting for a test with forked custom image (0.6.0 + fix) in production deployment.

HoustonPutman commented 1 year ago

Awesome, make the PR when your tests are done! We will probably start the process for the v0.7.0 release late next week, and it would be great to have this included.

HoustonPutman commented 1 year ago

TODO: We should think about just using the fsGroupChangePolicy available in Kubernetes v1.23

smoldenhauer-ish commented 1 year ago

I saw that fsGroupChangePolicy = onRootMismatch , but wasn't sure how to apply that to the Solr backup PVC It was kind of inspiration for the bugfix. If that would work, I guess the cp-solr-xml will be back to only copy the solr.xml

HoustonPutman commented 1 year ago

Yeah reading more through it, it seems like the default functionality is "Always", so its weird that it wouldn't be working in the first place. I guess we need to keep the logic in regardless.