cxl-micron-reskit / famfs

This is the user space repo for famfs, the fabric-attached memory file system
Apache License 2.0
31 stars 9 forks source link

Can famfs support configuring a smaller page size? #65

Closed leeq2016 closed 2 months ago

leeq2016 commented 3 months ago

When I use an 80G pmem device to mount famfs, it can only store about 30,000 files(these files are much smaller than 80G). Is this because a file must occupy at least 2M pages? Can famfs be configured for 4K page alignment, so as to avoid excessive waste of memory space when storing a large number of small files.

Error log: --------> mkfs: Capacity: Device capacity: 77.50G Bitmap capacity: 77.25G Sum of file sizes: 0.00G Allocated space: 0.00G Free space: 77.25G Famfs log: 0 of 472597 entries used // I modified my local log size here.

---------> Number of files: [root@famfs mnt]# find /mnt/famfs -type f|wc -l 37513 [root@famfs mnt]# du -sh ./famfs/ 5.0G ./famfs/

---------> Failed to copy new file to famfs: [root@famfs mnt]# famfs cp /home/test ./famfs/ bitmap_alloc_contiguous: alloc failed famfs_file_alloc: Out of space! famfs_mkfile: famfs_file_alloc(/mnt/famfs/test, size=6) failed __famfs_cp: failed in famfs_mkfile famfs_cp_multi: aborting copy due to error

jagalactic commented 3 months ago

This is definitely possible, but I'm not sure it makes sense. At one point early in the development of famfs, 2MiB alignment was broken which forced vma faults to be handled in 4K resolution. This caused analytics jobs under Ray to spend 66% of cpu time on vma lock contention. Failing to use 2MiB pages causes huge overheads when multiple processes are hammering on the same range of memory.

Because of this, and because the point of famfs is to provide a high performance interface to memory, we force huge page alignment and force allocations to be a multiple of the huge page size. We're not aware of any specific use cases that actually make sense using small files. If you want small files, why not just use ramfs or tmpfs? (ok, those can't share disaggregated memory, if that's important).

If you wanted to experiment with making this change, you would need to modify the code to change FAMFS_ALLOC_UNIT from 2MiB to 4KiB. That might expose other problems, or it might just work. If we added mainline support for allocation at 4K resolution, it would have to be an option, meaning FAMFS_ALLOC_UNIT could no longer be a #define - because 2MiB resolution is an absolute requirement for all the legitimate use cases that we are currently aware of.

If you do have a use case in mind (other than testing) that needs 4K pages, please tell us about it.

Thanks

leeq2016 commented 3 months ago

This is definitely possible, but I'm not sure it makes sense. At one point early in the development of famfs, 2MiB alignment was broken which forced vma faults to be handled in 4K resolution. This caused analytics jobs under Ray to spend 66% of cpu time on vma lock contention. Failing to use 2MiB pages causes huge overheads when multiple processes are hammering on the same range of memory.这绝对是可能的,但我不确定这是否有意义。在 famfs 开发的早期,2MiB 对齐被破坏,迫使 vma 故障以 4K 分辨率处理。这导致 Ray 下的分析作业将 66% 的 CPU 时间用于 vma 锁定争用。当多个进程在同一内存范围内敲击时,无法使用 2MiB 页面会导致巨大的开销。

Because of this, and because the point of famfs is to provide a high performance interface to memory, we force huge page alignment and force allocations to be a multiple of the huge page size. We're not aware of any specific use cases that actually make sense using small files. If you want small files, why not just use ramfs or tmpfs? (ok, those can't share disaggregated memory, if that's important).正因为如此,也因为 famfs 的目的是为内存提供高性能接口,我们强制对齐大量页面,并强制分配为巨大页面大小的倍数。我们不知道任何使用小文件真正有意义的特定用例。如果你想要小文件,为什么不直接使用 ramfs 或 tmpfs 呢?(好吧,如果这很重要的话,它们不能共享分解内存)。

If you wanted to experiment with making this change, you would need to modify the code to change FAMFS_ALLOC_UNIT from 2MiB to 4KiB. That might expose other problems, or it might just work. If we added mainline support for allocation at 4K resolution, it would have to be an option, meaning FAMFS_ALLOC_UNIT could no longer be a #define - because 2MiB resolution is an absolute requirement for all the legitimate use cases that we are currently aware of.如果要尝试进行此更改,则需要修改代码以将FAMFS_ALLOC_UNIT从 2MiB 更改为 4KiB。这可能会暴露其他问题,或者它可能只是工作。如果我们添加对 4K 分辨率分配的主线支持,它必须是一个选项,这意味着FAMFS_ALLOC_UNIT不再是 #define - 因为 2MiB 分辨率是我们目前所知道的所有合法用例的绝对要求。

If you do have a use case in mind (other than testing) that needs 4K pages, please tell us about it.如果您确实有一个需要 4K 页面的用例(测试除外),请告诉我们。

Thanks 谢谢

Thank you for your answer :)

Based on your explanation, assuming that my usage scenario is just multiple clients accessing famfs in a read-only manner, will using 4K pages cause additional overhead?

jagalactic commented 3 months ago

There is nothing specific to famfs that makes 4K pages a problem. The issue is that virtual-to-physical address mappings are cached in TLBs and page tables, but those get flushed every time a process is scheduled out. On a page table miss, the vma services a page fault, and asks famfs where the memory is for a specific file and offset. In the famfs case, the page always exists, so the fault is always resolved without blocking - but a lock is taken in the mm subsystem (not by famfs, by linux), and that lock can be a performance problem.

If we allow 4K mappings, there are 512x as many of these minor page faults. This particularly becomes a bottleneck if many processes are accessing the same set of files concurrently. If your use case does have concurrent file access from many processes, you might not observe any performance impact. If there is an impact, it would be the same with pretty much any file system.

If you do try recompiling with FAMFS_ALLOC_UNIT set to 4K, please report back and let me know 1) whether that's easy to get working, and 2) whether you observed any performance problems while doing that.

Regards, John

leeq2016 commented 3 months ago

Some problems I encountered when changing FAMFS_ALLOC_UNIT to 4K:

  1. famfs_file_init_dax need change ALIGNED SIZE to 4K
  2. When I copy a large number of files, the performance of bitmap alloc contiguous becomes a very significant bottleneck, I have issued an optimized PR: https://github.com/cxl-micron-reskit/famfs/pull/69

After I solved these two points, famfs seemed to be able to copy more than 300,000 files correctly, and the space usage was consistent with ext4. However, my use case has not run yet, maybe because famfs currently does not support symbolic links :<

jagalactic commented 3 months ago

So changing FAMFS_ALLOC_UNIT wasn't too bad - glad to hear that! I'm thinking about how we might make that a parameter at famfs mkfs time.

I'm curious how you git 300K files when the default metadata log size only has space for 25573 entries. Did you increase the log size?

Also, I actually took out symlink support in v2 of the famfs kernel patch set - it was in v1. Sorry - I wasn't aware of a symlink use case for symlinks in famfs. You could patch it back in if you're motivated to try this. Look for the famfs_symlink() function in fs/famfs/famfs_inode.c in the famfs_v1 branch, and add it in the famfs_v2 branch (including in inode_operations).

leeq2016 commented 3 months ago

So changing FAMFS_ALLOC_UNIT wasn't too bad - glad to hear that! I'm thinking about how we might make that a parameter at famfs mkfs time.

I'm curious how you git 300K files when the default metadata log size only has space for 25573 entries. Did you increase the log size?

Yes, I have set LOG SIZE to 256M and FAMFS_MAX_PATHLEN to 320 (In my use case the longest PATH reached 310+), so I can create about 400K files.

Also, I actually took out symlink support in v2 of the famfs kernel patch set - it was in v1. Sorry - I wasn't aware of a symlink use case for symlinks in famfs. You could patch it back in if you're motivated to try this. Look for the famfs_symlink() function in fs/famfs/famfs_inode.c in the famfs_v1 branch, and add it in the famfs_v2 branch (including in inode_operations).

I have patched the famfs_symlink() to famfs.ko and supported symbol link files in famfs cp and famfs logplay, it's run ok. :-)

jagalactic commented 3 months ago

I can add symlink support back into the v3 kernel patch set (no date for that yet, but it will be coming)

If you are interested in sending your FAMFS_ALLOC_UNIT patch, I'm interested in generalizing it into a famfs mkfs option. No guarantee as to how quickly, but it seems like a useful addition.

Thanks!

leeq2016 commented 3 months ago

I can add symlink support back into the v3 kernel patch set (no date for that yet, but it will be coming)

If you are interested in sending your FAMFS_ALLOC_UNIT patch, I'm interested in generalizing it into a famfs mkfs option. No guarantee as to how quickly, but it seems like a useful addition.

Thanks!

Do you mean the symlink patch? There are very few modifications to support FAMFS_ALLOC_UNIT=4096. But it took me some effort to support symlink type files in famfs cp.