Closed RianFuro closed 1 year ago
Assets are already part of image and symlinked in the image.
They will be empty in volume, they'll have resources in containers as they are symlinked from location inside container.
Try curl request with host header a site name?
All my setups are working, even tests are working.
Can you create a failing test somewhere? Or share access to failing setup?
Sorry for my late answer, wrote this before going to bed.
All my setups are working, even tests are working. Can you create a failing test somewhere? Or share access to failing setup?
I'm pretty sure this is somehow related to my system configuration. If all else fails I think I could hand you a snapshot of the VM the cluster runs on.
On that note, I forgot to mention that I'm using CRI-O instead of containerd for my container runtime. After checking the volume definitions in the backing dockerfile, that could definitely be an issue (I'm not knowledgeable enough on that matter though, it's really hard to find specific incompatibilities between docker and the CRI interface)
Assets are already part of image and symlinked in the image. They will be empty in volume, they'll have resources in containers as they are symlinked from location inside container.
I assumed this is how it's supposed to work. However, as you can see in pt. 4, the assets/
folder is simply not present after setting up.
I've since had a look at the frappe_docker code and the container definition - I assume bench init
populates the assets directory, which is held by an internal volume in the container?
I'll try comparing with a local minikube/docker installation and fiddle with cri-o a little bit. Depending on what I find I might throw a PR your direction to make note of any incompatibilities in the installation instructions.
I'm using CRI-O instead of containerd
Recently someone mentioned this offline. They were successful in running this helm chart on self-hosted CRI-O based cluster. Their storage classes were something custom that allowed RWX. Didn't ask what they were.
Can you try in-cluster nfs server using nfs-ganesha-server-and-external-provisioner like the tests?
I managed to get this NFS server configuration running to use with nfs-subdir-external-provisioner, setup details here: https://github.com/frappe/frappe/wiki/Setup-NFS-Server
Another question,
Did the configure job succeed without any errors in pod logs?
Look for volume permission or any other errors. Is there a .build
file under root of sites volume? touch sites/.build
to create it.
4. In fact, there is no
assets
folder at all!
This is where I feel things go wrong. Assets directory exists in container as well as volume. Only difference is, inside container it's populated with symlinked assets, inside volume it is present but empty.
Can you try in-cluster nfs server using nfs-ganesha-server-and-external-provisioner like the tests?
I can try, but my setup for the external nfs is almost identical and the mounts themselves look proper, so I don't think it's an issue with the storage driver itself.
Did the configure job succeed without any errors in pod logs?
Yes the configure job succeeds.
During my own testing I was able to trim it down to a much smaller test-case:
apiVersion: v1
kind: Pod
metadata:
name: frappe-playground
spec:
containers:
- name: frappe
image: frappe/erpnext
ports:
- name: http
containerPort: 8080
protocol: TCP
Checking this container, the filesystem structure looks proper:
kubectl apply -n erpnext -f frappe-minimal.yaml
pod/frappe-playground created
kubectl -n erpnext exec -it frappe-playground -- /bin/bash
frappe@frappe-playground:~/frappe-bench$ ls
apps config env logs patches.txt sites
frappe@frappe-playground:~/frappe-bench$ ls sites/
apps.json apps.txt assets common_site_config.json
frappe@frappe-playground:~/frappe-bench$ ls sites/assets/
assets-rtl.json assets.json css erpnext frappe js
frappe@frappe-playground:~/frappe-bench$
However, if I add a simple volume mount, the data is gone, seemingly overridden by the mounted volume.
apiVersion: v1
kind: Pod
metadata:
name: frappe-playground
spec:
containers:
- name: frappe
image: frappe/erpnext
ports:
- name: http
containerPort: 8080
protocol: TCP
volumeMounts:
- name: sites-dir
mountPath: /home/frappe/frappe-bench/sites
volumes:
- name: sites-dir
emptyDir: {}
kubectl apply -n erpnext -f frappe-minimal.yaml
pod/frappe-playground created
kubectl -n erpnext exec -it frappe-playground -- /bin/bash
frappe@frappe-playground:~/frappe-bench$ ls sites/
frappe@frappe-playground:~/frappe-bench$
Note that I'm just using emptyDir here as a storage driver, so the nfs shouldn't be the issue.
As soon as you mount volume it becomes empty is expected.
What's not expected is the assets becoming empty. Assets is different volume and as it has no vol driver creating it, container should create and use unnamed volume with assets available in it. (Assumption)
This nesting of volume creates problems that's why 2 mounts. Second mount ensures assets directory is separate from parent sites mount.
I realize that's what is intended, but cri-o does not handle image volumes like docker/containerd does.
It's not all that well documented, but from what I've gathered crio has a
image_volumes
configuration option under [crio.image]
that controls how image volumes are handled:
- mkdir: A directory is created inside the container root filesystem for the volumes.
- bind: A directory is created inside container state directory and bind mounted into the container for the volumes.
- ignore: All volumes are just ignored and no action is taken. (default: mkdir)
I have tried setting image_volumes="bind"
which restores the intended volume layout... kind of:
kube@kube-control:~$ kubectl apply -f frappe-minimal.yaml -n erpnext
pod/frappe-playground created
kube@kube-control:~$ kubectl exec -n erpnext -it frappe-playground -- /bin/bash
frappe@frappe-playground:~/frappe-bench$ ls
apps config env logs patches.txt sites
frappe@frappe-playground:~/frappe-bench$ ls sites/
assets
frappe@frappe-playground:~/frappe-bench$ ls -lah sites/
total 8.0K
drwxr-xr-x 3 root root 60 Jul 16 09:01 .
drwxr-xr-x 1 frappe frappe 4.0K Jul 16 00:10 ..
drwxr-xr-x 2 root root 40 Jul 16 09:01 assets
Big caveat here: the volumes are mounted as root
, so right now I don't have any write-access to the mounted volumes... Sure I could run the container as root, but I'd like to avoid that.
One common solution to the volue mounting problem I found so far is that people just use an init container to copy the existing files into the volume (while mounting the volume on a different folder, just for the init step).
the volumes are mounted as
root
Doesn't matter if it's assets, you don't need write access.
For sites and logs there is an initContainer to fixVolume, it chown files to 1000:1000
Found related blog article: https://medium.com/cri-o/cri-o-configurable-image-volume-support-dda7b54f4bda
image_volumes="bind"
Can you try "mkdir" if it makes any difference? We just need it for symlink
Can you try "mkdir" if it makes any difference? We just need it for symlink
"mkdir" is the default and behaves as described previously. (I've had it explicitly set to mkdir for a while now since I've tried the alternative)
For sites and logs there is an initContainer to fixVolume, it chown files to 1000:1000
This failed for me (with operation not permitted as far as I remember) when I deployed the chart under image_volumes="bind"
.
I'll run it again in a bit and give you the full logs.
I'll run it again in a bit and give you the full logs.
Ok, it didn't fail, it just... had no apparent effect - owner of the volume is still root and subsequent calls in the config job failed with permission denied. Also no log output from the init container.
I would think that this could be due to the storage driver not permitting it, except previously the permission was set without an issue and I didn't change anything related to the nfs server or it's storage class.
My guess is that the overlayfs in between (due to the bind mount) is getting in the way this time, but I need to hook into the running container to check that.
Will whip something up in the afternoon. Will probably also try with an in-cluster nfs like you suggested previously
Ok, those are... interesting results. I put some sleeps into the config job containers (both init and the configure container), to hook into them and check the FS ownership during execution, here are the resulsts:
The init container, after chown
❯ kubectl exec -n erpnext --stdin --tty frappe-bench-erpnext-conf-bench-20230716155227-bmzws -c frappe-bench-ownership -- /bin/bash
root@frappe-bench-erpnext-conf-bench-20230716155227-bmzws:/home/frappe/frappe-bench# ls -lah
total 24K
drwxr-xr-x 7 frappe frappe 4.0K Jul 10 14:12 .
drwxr-xr-x 1 frappe frappe 4.0K Jul 10 14:15 ..
drwxr-xr-x 4 frappe frappe 4.0K Jul 10 14:14 apps
drwxr-xr-x 3 frappe frappe 4.0K Jul 10 14:12 config
drwxr-xr-x 6 frappe frappe 4.0K Jul 10 14:13 env
drwxr-xr-x 2 frappe frappe 40 Jul 16 13:52 logs
-rw-r--r-- 1 frappe frappe 346 Jul 10 14:12 patches.txt
drwxr-xr-x 3 frappe frappe 60 Jul 16 13:52 sites
The configure container, before the first actual line of code:
❯ kubectl exec -n erpnext --stdin --tty frappe-bench-erpnext-conf-bench-20230716155227-bmzws -- /bin/bash
Defaulted container "configure" out of: configure, frappe-bench-ownership (init)
frappe@frappe-bench-erpnext-conf-bench-20230716155227-bmzws:~/frappe-bench$ ls -lah
total 24K
drwxr-xr-x 7 frappe frappe 4.0K Jul 10 14:12 .
drwxr-xr-x 1 frappe frappe 4.0K Jul 10 14:15 ..
drwxr-xr-x 4 frappe frappe 4.0K Jul 10 14:14 apps
drwxr-xr-x 3 frappe frappe 4.0K Jul 10 14:12 config
drwxr-xr-x 6 frappe frappe 4.0K Jul 10 14:13 env
drwxr-xr-x 2 root root 40 Jul 16 13:53 logs
-rw-r--r-- 1 frappe frappe 346 Jul 10 14:12 patches.txt
drwxr-xr-x 3 root root 60 Jul 16 13:53 sites
Now to the interesting part.
I also tried to produce a stripped down version of the problem:
kind: Pod
apiVersion: v1
metadata:
name: volume-editor
spec:
volumes:
- name: sites-dir
persistentVolumeClaim:
claimName: frappe-bench-erpnext
initContainers:
- name: frappe-bench-ownership
image: frappe/erpnext
command: ['sh', '-c']
args:
- chown -R 1000:1000 /data
securityContext:
runAsUser: 0
volumeMounts:
- name: sites-dir
mountPath: /data
containers:
- name: sleeper
image: frappe/erpnext
command: ['sleep', 'infinity']
volumeMounts:
- name: sites-dir
mountPath: /data
Which actually sets the permission correctly!
kube@kube-control:~$ kubectl exec -it -n erpnext volume-editor -- /bin/sh
Defaulted container "sleeper" out of: sleeper, frappe-bench-ownership (init)
$ bash
frappe@volume-editor:~/frappe-bench$ ls /data
frappe@volume-editor:~/frappe-bench$ ls -lah /data/
total 8.0K
drwxrwxrwx 2 frappe frappe 4.0K Jul 16 09:27 .
dr-xr-xr-x 1 root root 4.0K Jul 16 12:26 ..
no idea what's going on here.
I thought about this issue in general and while I enjoy tinkering with this, I think a solution that depends on a system setting (setting image_volumes="bind"
in cri-o is a system-wide setting) is less than ideal.
As for myself, I will likely either switch to using containerd or add an additional init container to the config job that will copy over the initial data into the mounted volume.
As for the latter idea, if that works out, would you accept a PR integrating that into the config job template, behind a flag-parameter? (I suppose a separate job would be fine too)
I've proposal,
We've entrypoint script for nginx anyway. We can check if assets exists or create a dir and symlink.
Here: https://github.com/frappe/frappe_docker/blob/main/resources/nginx-entrypoint.sh
Can you try it in custom image? If it works we'll make the change.
Edit:
Other containers may also need assets, rendering pdf with css, sending email, rendering jinja2 templates.
We can make it into an entrypoint script that can be optionally overridden for such directory creation and symlink.
FYI, I've basically run out of time work-wise and simply switched over to using containerd. Given the problems I've had here and the generally poor documentation of CRI-O this seems the more reasonable solution to me, for the time being.
All other pieces of my setup staying the same, ERPNext now works without an issue :)
As for your suggestion:
First of all, you're the maintainer here, so you don't have to make any proposals to me :sweat_smile:. That being said, if you want my opinion on the matter, I wouldn't try to solve a Kubernetes issue by including a workaround in the container, especially when a reasonable solution like an init job is perfectly workable.
That being said, if you want to tackle this at the container level I would go all the way and restructure the file system layout, such that the assets
folder is no longer a subdirectory of a volume. NGINX specifically shouldn't have a problem with that, since you have a specific routing rule for assets already, but obviously I can't speak to other containers and their dependency on the assets folder.
First of all, you're the maintainer here, so you don't have to make any proposals to me . That being said, if you want my opinion on the matter, I wouldn't try to solve a Kubernetes issue by including a workaround in the container, especially when a reasonable solution like an init job is perfectly workable.
Okay! I'll leave it at what it is right now.
That being said, if you want to tackle this at the container level I would go all the way and restructure the file system layout, such that the assets folder is no longer a subdirectory of a volume. NGINX specifically shouldn't have a problem with that, since you have a specific routing rule for assets already, but obviously I can't speak to other containers and their dependency on the assets folder.
Yes! I'd prefer that too.
I think bench command and frappe framework assumes the directory structure of sites
and sites/assets
.
We can manage the directory, routes, nginx config in containers. The framework will still need the above structure.
All the custom apps also follow this structure enforced by framework.
To summarize
image_volumes="bind"
Faced this recently,
Setup consist of NAS for Storage for RWX storage class and CRI-O with image_volumes=mkdir (default)
Added ENTRYPOINT: ["entrypoint.sh"]
in image. Script as follow:
#!/bin/bash
# Create assets directory if not found
[ -d "${PWD}/sites/assets" ] || mkdir -p "${PWD}/sites/assets"
# Copy assets*.json from image to assets volume if updated
cp -uf /opt/frappe/assets/*.json "${PWD}/sites/assets/" 2>/dev/null
# Symlink public directories of app(s) to assets
find apps -type d -name public | while read -r line; do
app_name=$(echo "${line}" | awk -F / '{print $3}')
assets_source=${PWD}/${line}
assets_dest=${PWD}/sites/assets/${app_name}
ln -sf "${assets_source}" "${assets_dest}";
done
exec "$@"
I was having the same journey on a cluster with cri-o:
mkdir /home/frappe/frappe-bench/sites/assets
in the frappe-bench-ownership init containerbench update
in gunicorn podentrypoint.sh
- no effectbench update
in the nginx pod)So there seems to be still an issue with the first round of assets generation. And I am wondering what happens with the app assets after the next upgrade.
Changing the setting image_volumes="bind"
in cri-o is not an option, as this is a system-wide setting and would probably affect several other deployments on the same cluster.
I solved it by adding bench build
to nginx deployment every time it is started:
lifecycle:
postStart:
exec:
command: ["/bin/sh", "-c", "bench build"]
(I use kubectl, which I derived from the helm chart)
TL;DR of this thread
CRI-O handles image volumes (the one's defined in the Dockerfile!) differently from Containerd by default, which leads to the app assets not being properly mounted on top of the sites volume.
You can configure cri-o to use bind-mounting (almost like containerd does) by setting
image_volumes= "bind"
in/etc/crio/crio.conf
(instead of the defaultmkdir
.You might still have permission problems because the mounted volume cannot properly be reassigned ownership, but I haven't tested that enough to know whether this is the storage engines or cri-o's fault. YMMV
Some people instead work around this issue by copying the existing files from the image into the volume in an initContainer (by just mounting the volume somewhere else just for that execution). I haven't gone down that route though, so again YMMV.
Description of the issue
I have been deploying this helm chart according to the installation instructions at https://github.com/frappe/helm/blob/main/erpnext/README.md and I've had a pretty rocky experience. I'm not entirely sure that I didn't overlook something very obvious, however after multiple skims through the instructions, the chart and googling for people having similar issues I have no idea what I could have missed. As such, this is part bug report and part request for guidance.
TL;DR: After following the installation instructions for deploying the chart and adding a site I am greeted with an internal server error when trying to access it through browser (step by step walkthrough of my experience below). I've had to resort to
kubectl exec -it ...
into the running nginx pod to get the site to a working state, which I am fairly certain is not how it's supposed to work.Context information (for bug reports)
For what it's worth, I've deployed the helm chart to a fresh, in-house single-node kubernetes cluster with CRI-O as container runtime. It's a fresh installation on top of debian 12, using calico for networking and a way too permissive (think chmod 777) NFS for persistent volumes via https://github.com/kubernetes-sigs/nfs-subdir-external-provisioner.
Steps to reproduce the issue
custom-values.yaml
NOTES: NodePort mapping for simplicity, since we have an existing reverse proxy and I haven't reconciled that with using an ingress. I've checked that the requests are properly forwarded to where they should be, so I doubt this is an issue regarding the troubles I've had.
create-job-custom-values.yaml
erp.example.com
:This is where it gets dicey. Trying to access the new site only results in an internal server error with the following logs:
assets.json
file somewhere around here: https://github.com/frappe/frappe/blob/fefd9ac2e2190d37d3669390a2d6285506a2646c/frappe/utils/__init__.py#L964C1-L985 However, the file it's trying to load here doesn't exist at this point in time on my volume. In fact, there is noassets
folder at all!:If the assets don't exist, might as well try and make them so. So I exec into one of the worker pods to manually run some
bench
commandsI realize this is certainly not the right approach, but I'm tinkering here so bear with me. I just want to get to a workable state so I can track back what I'm missing afterwards.
Doing that creates the
assets
folder and the missingassets.json
inside:However, while I can now load the page, the included CSS still can't be fetched, resulting in 404s for those resources.
The hash matches. However, when checking the nginx container, the same path actually resolves to different files!:
After some flailing I finally find out why that is: The assets from the individual apps are actually symlinked in from outside the volume!
So, after running
bench build --force
in the nginx pod specifically I finally have a working page result.Now, reading the
bench
cli docs would have certainly saved me some headache there, since it has an option to exactly NOT do that (marked as deprecated though), but all of that does make me wonder:Expected result
How is this supposed to work exactly? Not only have I found no other issues or google results touching on the troubles I've had, but also looking at the helm chart I don't really understand how the assets are supposed to be put into their proper place at all.
Since the documentation doesn't mention anything regarding having to take care of app assets, I would expect that after adding the site it just workstm, but for now I can't see how it would work.
What exactly have I missed here?