Subset of Containers Fail to Mount When Using "mount all"

michaelschmit commented 7 months ago

Which version of blobfuse was used?

blobfuse2-2.2.1-1.x86_64

Which OS distribution and version are you using?

RHEL 8.9

If relevant, please share your mount command.

su -l [user] -c "blobfuse2 mount all /mount/blobfuse --config-file=/mount/config.yaml"

What was the issue encountered?

A handful of containers of the ~130 containers fail to mount with an empty "ERROR: " message. Also, the permissions on those particular directories are different than the others mounted. It will consistently fail to mount the same number of containers, but won't be the same containers each time.

Have you found a mitigation/solution?

No

Please share logs if available.

Please ask for particular logs if needed. /var/log/blobfuse2.log doesn't seem to contain anything useful and the error message is empty as indicated above.

michaelschmit commented 7 months ago

Looks like in the .blobfuse2 directory for the user, the containers that fail do not get a pid in the mount*.pid files

vibhansa-msft commented 7 months ago

Hi @michaelschmit thanks for reaching out to Blobfuse team.

By default blobfuse uses syslog. Unless you have installed syslog filters for blobfuse logs will be redirected to /var/log/syslog. to use /var/log/blobfuse2.log you will need to install our syslog filters.
~/.blobfuse2 is a directory we use for our housekeeping to lock the pids and keep track on what all directories are already mounted. User mount blobfuse shall have the required permissions to make this work. If no, you can change the working directory as well through a cli parameter.
For any container that fails to mount, blobfuse instance will not run and hence you will see only successful mounts with .pid files.
As you are mounting multiple containers here each shall have a unique mount path and a local cache directory. Kindly confirm the same and ensure this is followed.

Is there any particular reason to mount all 130 containers ? As this will cause 130 instances of blobfuse running on the same VM/Node and may create some sort of resource crunch on the system.

michaelschmit commented 7 months ago

I'll answer your bullets and then add another comment about the debugging I have done.

Yes, I see blobfuse2 uses syslog. However, I believe that is only the blobfuse2 process that syncs the mount. I believe the initial mount operations only output to stdout/stderr. Please correct me if I am wrong.
Yes, I am seeing the config.yaml and .pid files in .blobfuse2 directory. I was just noting that for the containers that fail to mount, the config.yaml are equivalent to other working containers, but the .pid file for the ones that fail are empty. The ones that succeed have a pid populated. This seems to indicate that perhaps the pipeline to create a new process is failing.
The ones that fail to mount are getting a *.pid, but are empty.
Yes, we have a separate mount and cache directories. We are using the "mount all" command so the empty /mount directory gets created with directories for all the containers and the empty /data/cache directory gets created with directories for all the containers as well.
Our storage account is already setup to utilize a bunch of containers. With the research I have done, I don't see a limit to the number of mount points for an Azure VM or limits in blobfuse that would prevent this from working. If there was a way to use blobfuse to mount all the containers at a storage account level, that would be better, but I'm not aware that option exists.

michaelschmit commented 7 months ago

Here is more information about what I am seeing. When I kick off the "blobfuse2 mount all" operation the first ~126 succeed without issue but the last 5-6 fail with:

Failed to mount container xxxxxxxxx : Error:

This occurs whether or not I am mounting using a "su -l [user]" or "sudo -u [user]" or just running as the current user. With that said, I am going to change the title of this issue to reflect that.

Like I mentioned, when I diff the _config.yaml files the only differences are the container names. For the container that failed to mount, I get a .pid file but it is empty. If I manually try to mount a failed container with "blobfuse2 mount /mount/xxxxxxx --config-file=[home dir]/.blobfuse2/config_*.yaml it still fails. With the same empty error.

What I discovered just now, is if I unmount a previous mounted container with "blobfuse2 unmount /mount/xxxxxx", I am able to mount a previous failed container. This likely indicates some threshold issue, where removing one then allows another to succeed. I am going to do some more research to see if I am reaching a mount threshold in the Azure VM or something within blobfuse.

Another interesting data point is that on a previous instance where I was debugging with slightly less mounted containers. (this time ~128 vs ~131), 2 containers failed to mount but I was able to eventually get them to mount individually over many attempts.

michaelschmit commented 7 months ago

Since the error returned is empty, I have been trying to figure out where the failure is occurring. I am looking at the individual mount command (mountCmd) since it happens on subsequent single mount commands as well. I think this helps narrow the scope a little. I don't end up seeing the critical message "Starting Blobfuse2 Mount", so I have to assume for now it is not getting to that point. I have experimented with trying to disable monitoring, just in case that had something to do with it, using the config:

health_monitor:
    enable-monitoring: false

But perhaps that is disabled by default.

michaelschmit commented 7 months ago

OK, I see that I was wrong about the stdout/stderr statement. Here are the syslogs during the mount operation:

Mar 12 18:28:50 [host] blobfuse2[526721]: LOG_INFO [mount.go (415)]: mount: Mounting blobfuse2 on /mount/xxxxxx
Mar 12 18:28:50 [host] blobfuse2[526721]: LOG_DEBUG [mount.go (453)]: mount: foreground disabled, child = false
Mar 12 18:28:50 [host] blobfuse2[526721]: LOG_INFO [mount.go (471)]: mount: Child [526732] terminated from /mount/xxxxxx

michaelschmit commented 7 months ago

Had another instance where some of the mounts failed during the initial mount all, but then I was able to get one of them individually mounted. The rest failed no matter how many attempts were made.

michaelschmit commented 7 months ago

The difference I see between a failed mount and a successful one is the log line:

Failed: LOG_DEBUG [mount.go (453)]: mount: foreground disabled, child = false

Success: LOG_DEBUG [mount.go (453)]: mount: foreground disabled, child = true

michaelschmit commented 7 months ago

Seems like I may be running into a limitation in the go-daemon that is being imported. I've looked at ulimit -a but it looks like max user processes is 62840, so I am probably not running into that.

vibhansa-msft commented 7 months ago

Thanks for providing the detailed info.

For the containers that fail to mount .pid file will not be created as this file actually represents the process which has mounted the container and mount has already failed. For success mounts this file is used to identify which directory is mounted.
Can you enable log debug mode through your config file or cli so that we get more details into why some mount fails.
If observation is that total number of mounts are getting limited no matter in what order you mount then your analysis might be correct that we are running into some sort of golang/system level limitations in terms of daemons.
There is another way to move away from teh daemon mode where you can run blobfuse just like a regular foreground application but this means your shell will be hung up till the unmount happens so you need to fire each mount command in an independent shell. Unmount can be done through another shell.
Let me know if this works as a potential workaround for now.
I had a feature in my todo list where we allow to mount entire account instead of mounting each container with a seperate daemon but never saw any customer demanding that feature so its still in the todo list only :)

michaelschmit commented 7 months ago

Can you enable log debug mode through your config file or cli so that we get more details into why some mount fails.

The log lines I posted above is with this setting in the config_*.yaml:

logging:
    level: log_debug
    type: syslog

The only debug line that provides any clue is the LOG_DEBUG [mount.go (453)]: mount: foreground disabled, child = false

Seems like 125 containers appears to be the hard limit. If I observe otherwise I will make sure to update. I have scaled the VM up to see if it was a hardware limitation, but it doesn't appear so. I also temporarily disabled selinux, but that had no effect either.

I can try to kick off and background the processes myself. Need to think through that a bit on implementation.

vibhansa-msft commented 7 months ago

if you manually mount using a script, does it still create the same limitation of 125 containers? Just to validate the theory of having some OS/hardware level limitations.

michaelschmit commented 7 months ago

Another thing I tried today was moving the blobfuse2 testing to a physical server. Interestingly enough, I was only able to mount 116 containers on the physical system, which is a beefier box (72 logical cores and 192GB of memory) than what I am using in Azure. I strace'd a working mount and one that failed. I am currently trying to look through the diff for a smoking gun, but haven't come up with anything yet.

I've tried running the mount from a script and that had no effect.

michaelschmit commented 7 months ago

It would be helpful if you could try to reproduce it from your side as well. If indeed you run into a ~100 container limit then perhaps that should be added to the limitation section of the README.md. I don't think it is unusual to have a large number of containers because from an Azure blob storage standpoint, if you want to delete data, you can delete an entire container at once or you can delete a single blob at a time (since directories don't really exist in blob storage). Deleting a container performs a lot faster than deleting hundreds or thousands of blobs.

michaelschmit commented 7 months ago

OK, after digging in the strace I found the culprit: <... inotify_init1 resumed>) = -1 EMFILE (Too many open files)

The setting that is hanging things up is: cat /proc/sys/fs/inotify/max_user_instances

Increasing this value via echo 256 | sudo tee /proc/sys/fs/inotify/max_user_instances or via /etc/sysctl.conf allows all the containers to mount.

michaelschmit commented 7 months ago

With that said, the currently implementation is probably not long term scalable for us (with the daemon overhead for every mount), but this at least is a work around in the short/intermediate term to get us by.

vibhansa-msft commented 7 months ago

Thanks for sharing this. This is great insight and I really appreciate you digging deep to figure this out. I believe if it's about 'inotify' then this limit might be coming as we register for file-change-notification on our config file so that user can change some of the settings dynamically in config file and blobfuse can reconfigure itself on the fly (not all configs can be modified dynamically).

As I mentioned earlier mounting entire storage account in one daemon is the long-term solution which is already in our todo list. Let me bring this up to our PM and see if we can priorities that item.

vibhansa-msft commented 7 months ago

For now blobfuse is working as expected and mounting too many containers is an issue I agree. I will close this item here and we will check on mounting entire account in one shot seperatly. Feel free to reopen if there is anything else you need from blobfuse end to fix.

For the documentation part, your feedback is well received, and we will update our README for the same.

vibhansa-msft commented 7 months ago

@michaelschmit Let me now if the linked PR provides sufficient info on this or not.

michaelschmit commented 7 months ago

The PR seems sufficient for documentation.

If it is the inotify on the config, another option would be a setting in the config to toggle the dynamic loading functionality. It is pretty quick and easy to unmount and mount the container to pick up config changes. But don't feel pressured to execute on that. If you don't think that is useful we can always modify the source ourselves in the future.

vibhansa-msft commented 7 months ago

Unmount and mount are not acceptable to many of our customers as unmount means wipe out the local cache. There are reasons why we chose to load the config dynamically for small changes in config while changes like storage account or container still needs a remount. Once we have the mount of an storage account as a feature it will save lot of resources as well on a given VM/node.

Azure / azure-storage-fuse