Closed michaelschmit closed 7 months ago
Looks like in the .blobfuse2 directory for the user, the containers that fail do not get a pid in the mount*.pid files
Hi @michaelschmit thanks for reaching out to Blobfuse team.
Is there any particular reason to mount all 130 containers ? As this will cause 130 instances of blobfuse running on the same VM/Node and may create some sort of resource crunch on the system.
I'll answer your bullets and then add another comment about the debugging I have done.
Yes, I see blobfuse2 uses syslog. However, I believe that is only the blobfuse2 process that syncs the mount. I believe the initial mount operations only output to stdout/stderr. Please correct me if I am wrong.
Yes, I am seeing the config.yaml and .pid files in .blobfuse2 directory. I was just noting that for the containers that fail to mount, the config.yaml are equivalent to other working containers, but the .pid file for the ones that fail are empty. The ones that succeed have a pid populated. This seems to indicate that perhaps the pipeline to create a new process is failing.
The ones that fail to mount are getting a *.pid, but are empty.
Yes, we have a separate mount and cache directories. We are using the "mount all" command so the empty /mount directory gets created with directories for all the containers and the empty /data/cache directory gets created with directories for all the containers as well.
Our storage account is already setup to utilize a bunch of containers. With the research I have done, I don't see a limit to the number of mount points for an Azure VM or limits in blobfuse that would prevent this from working. If there was a way to use blobfuse to mount all the containers at a storage account level, that would be better, but I'm not aware that option exists.
Here is more information about what I am seeing. When I kick off the "blobfuse2 mount all" operation the first ~126 succeed without issue but the last 5-6 fail with:
Failed to mount container xxxxxxxxx : Error:
This occurs whether or not I am mounting using a "su -l [user]" or "sudo -u [user]" or just running as the current user. With that said, I am going to change the title of this issue to reflect that.
Like I mentioned, when I diff the _config.yaml files the only differences are the container names. For the container that failed to mount, I get a .pid file but it is empty. If I manually try to mount a failed container with "blobfuse2 mount /mount/xxxxxxx --config-file=[home dir]/.blobfuse2/config_*.yaml it still fails. With the same empty error.
What I discovered just now, is if I unmount a previous mounted container with "blobfuse2 unmount /mount/xxxxxx", I am able to mount a previous failed container. This likely indicates some threshold issue, where removing one then allows another to succeed. I am going to do some more research to see if I am reaching a mount threshold in the Azure VM or something within blobfuse.
Another interesting data point is that on a previous instance where I was debugging with slightly less mounted containers. (this time ~128 vs ~131), 2 containers failed to mount but I was able to eventually get them to mount individually over many attempts.
Since the error returned is empty, I have been trying to figure out where the failure is occurring. I am looking at the individual mount command (mountCmd) since it happens on subsequent single mount commands as well. I think this helps narrow the scope a little. I don't end up seeing the critical message "Starting Blobfuse2 Mount", so I have to assume for now it is not getting to that point. I have experimented with trying to disable monitoring, just in case that had something to do with it, using the config:
health_monitor:
enable-monitoring: false
But perhaps that is disabled by default.
OK, I see that I was wrong about the stdout/stderr statement. Here are the syslogs during the mount operation:
Mar 12 18:28:50 [host] blobfuse2[526721]: LOG_INFO [mount.go (415)]: mount: Mounting blobfuse2 on /mount/xxxxxx
Mar 12 18:28:50 [host] blobfuse2[526721]: LOG_DEBUG [mount.go (453)]: mount: foreground disabled, child = false
Mar 12 18:28:50 [host] blobfuse2[526721]: LOG_INFO [mount.go (471)]: mount: Child [526732] terminated from /mount/xxxxxx
Had another instance where some of the mounts failed during the initial mount all, but then I was able to get one of them individually mounted. The rest failed no matter how many attempts were made.
The difference I see between a failed mount and a successful one is the log line:
Failed:
LOG_DEBUG [mount.go (453)]: mount: foreground disabled, child = false
Success:
LOG_DEBUG [mount.go (453)]: mount: foreground disabled, child = true
Seems like I may be running into a limitation in the go-daemon that is being imported. I've looked at ulimit -a but it looks like max user processes is 62840, so I am probably not running into that.
Thanks for providing the detailed info.
Can you enable log debug mode through your config file or cli so that we get more details into why some mount fails.
The log lines I posted above is with this setting in the config_*.yaml:
logging:
level: log_debug
type: syslog
The only debug line that provides any clue is the LOG_DEBUG [mount.go (453)]: mount: foreground disabled, child = false
Seems like 125 containers appears to be the hard limit. If I observe otherwise I will make sure to update. I have scaled the VM up to see if it was a hardware limitation, but it doesn't appear so. I also temporarily disabled selinux, but that had no effect either.
I can try to kick off and background the processes myself. Need to think through that a bit on implementation.
if you manually mount using a script, does it still create the same limitation of 125 containers? Just to validate the theory of having some OS/hardware level limitations.
Another thing I tried today was moving the blobfuse2 testing to a physical server. Interestingly enough, I was only able to mount 116 containers on the physical system, which is a beefier box (72 logical cores and 192GB of memory) than what I am using in Azure. I strace'd a working mount and one that failed. I am currently trying to look through the diff for a smoking gun, but haven't come up with anything yet.
I've tried running the mount from a script and that had no effect.
It would be helpful if you could try to reproduce it from your side as well. If indeed you run into a ~100 container limit then perhaps that should be added to the limitation section of the README.md. I don't think it is unusual to have a large number of containers because from an Azure blob storage standpoint, if you want to delete data, you can delete an entire container at once or you can delete a single blob at a time (since directories don't really exist in blob storage). Deleting a container performs a lot faster than deleting hundreds or thousands of blobs.
OK, after digging in the strace I found the culprit: <... inotify_init1 resumed>) = -1 EMFILE (Too many open files)
The setting that is hanging things up is: cat /proc/sys/fs/inotify/max_user_instances
Increasing this value via echo 256 | sudo tee /proc/sys/fs/inotify/max_user_instances
or via /etc/sysctl.conf
allows all the containers to mount.
With that said, the currently implementation is probably not long term scalable for us (with the daemon overhead for every mount), but this at least is a work around in the short/intermediate term to get us by.
Thanks for sharing this. This is great insight and I really appreciate you digging deep to figure this out. I believe if it's about 'inotify' then this limit might be coming as we register for file-change-notification on our config file so that user can change some of the settings dynamically in config file and blobfuse can reconfigure itself on the fly (not all configs can be modified dynamically).
As I mentioned earlier mounting entire storage account in one daemon is the long-term solution which is already in our todo list. Let me bring this up to our PM and see if we can priorities that item.
For now blobfuse is working as expected and mounting too many containers is an issue I agree. I will close this item here and we will check on mounting entire account in one shot seperatly. Feel free to reopen if there is anything else you need from blobfuse end to fix.
For the documentation part, your feedback is well received, and we will update our README for the same.
@michaelschmit Let me now if the linked PR provides sufficient info on this or not.
The PR seems sufficient for documentation.
If it is the inotify on the config, another option would be a setting in the config to toggle the dynamic loading functionality. It is pretty quick and easy to unmount and mount the container to pick up config changes. But don't feel pressured to execute on that. If you don't think that is useful we can always modify the source ourselves in the future.
Unmount and mount are not acceptable to many of our customers as unmount means wipe out the local cache. There are reasons why we chose to load the config dynamically for small changes in config while changes like storage account or container still needs a remount. Once we have the mount of an storage account as a feature it will save lot of resources as well on a given VM/node.
Which version of blobfuse was used?
blobfuse2-2.2.1-1.x86_64
Which OS distribution and version are you using?
RHEL 8.9
If relevant, please share your mount command.
su -l [user] -c "blobfuse2 mount all /mount/blobfuse --config-file=/mount/config.yaml"
What was the issue encountered?
A handful of containers of the ~130 containers fail to mount with an empty "ERROR: " message. Also, the permissions on those particular directories are different than the others mounted. It will consistently fail to mount the same number of containers, but won't be the same containers each time.
Have you found a mitigation/solution?
No
Please share logs if available.
Please ask for particular logs if needed. /var/log/blobfuse2.log doesn't seem to contain anything useful and the error message is empty as indicated above.