Closed fengyuleidian0615 closed 5 years ago
Hi @Andiry
Just wondering, could you please help to review this change to add huge block allocation?
Thanks!
Thank you for posting the patch! Yes, I will review them ASAP. For the past two weeks I was busy on thesis and defense.
ok, got it :)
This requires request to specify 512 pages allocation explicitly. Does application ask for 512 page allocation if it wants huge page mmap?
User does not explicitly request huge allocation, this is automatically done in the following scenario.
For non-mmap case, the allocated block size depends on the size of data user want to write, so if user request to write 2MB data, then we try to allocated 2MB aligned block first.
As for the deadlock issue, I will think more about it, thanks for the pointer. Please don't hesitate to share your thoughts anyway :)
I do have a thought about the allocation. This patch performs a O(n) search for 2MB ranges. Considering that rb_next() is O(logn), the actual complexity is O(nlogn). In theory it should not be too bad since we merge the nodes and keep the red-black tree compact, but I would like to have a O(1) or O(logn) allocator as allocation is performed frequently in NOVA.
Here is the idea: During initialization, break NVM range into 2MB blocks, and managed with a linked list (or a red-black tree).
For 2MB page, simply grabs a 2MB block and return. For 4KB page, we can allocation a 2MB block and break it into 512 4KB pages, or we can allocate from existing shard 4KB pages. We can use a red-black tree to manage shard pages in each broke 2MB page.
For deallocation, 2MB block simply adds back to the linked list. Small pages will merge with other existing 4KB pages, and if it formats a 2MB page, add back to the linked list. How does it sound?
Thanks for sharing your previous comments and deep thoughts Andiry! I have updated the patch per your request, please help to review.
Before heading to the direction of this patch looks like, honestly I think about more alternatives actually. The complexity grow along with my thoughts drifted far away... First thought is naive to set allocator to be 2MB aware only, that's easy to implement, but way to waste of space in small file case. Then why not side aside space dedicated for 2MB, whether managed by rb tree or link listed as you suggested on a per-cpu basis, while it's not clear to me how much size(sensible) to put aside for 2MB space or build knob/mod param for it, more importantly block node state should be saved gracefully, it's bit of intrusive for the moment. Honestly speaking, out of respect for the original design, I choose the more natural gentle way, or minimal cleaner way to fit into current design, that's how this patch was born.
Yes, ideally a more robust full fledged allocator is needed to favor different scenario for future persistent memory filesystem, I thought about simple but scaleable buddy-system style too, that's fun to NOVA maybe. At the moment I'm just starting ramping up your work bit by bit, will put more effort on existing design, and will looking forward on your suggestions.
Sure, I agree with you that we should improve NOVA piece by piece. This is a great start and I really appreciate your help. I will review your patches, perhaps by this weekend.
v2:
@Andiry Thanks for your time to review! Indeed. I push a updated version based on your comments, and refine the code comments, and reload the new test log.
Please check it out!
v3:
@Andiry Please review the updated version as per your suggestions. I will send you the test case by mail.
v4:
Please review, 3Q
v5:
This is a follow-up of https://github.com/NVSL/linux-nova/pull/64.
To leverage huge page mapping when doing mmap, both virtual address range as well as physical block is required to be aligned at huge page boundary.
This patch aims to support 2M allocation first as 1G huge page mapping is not yet supported in fs/dax level.
It's straight forward to try huge allocation first, fall back to un-aligned allocation if unlucky.
Note: We don't need to hard code NOVA_DEFAULT_BLOCK_TYPE to 2M when initiates inode structure, the page fault handler itself will try huge mapping, and then pte mapping if huge mapping is not welcomed.
Test fsdax trace log: fs-write-12957 [063] .... 3780.549974: dax_pmd_fault: dev 259:0 ino 0xd0 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x7fa7bf800000 vm_start 0x7fa7bf800000 vm_end 0x7fa7bfc00000 pgoff 0x0 max_pgoff 0x3ff fs-write-12957 [063] .... 3780.553204: dax_pmd_insert_mapping: dev 259:0 ino 0xd0 shared write address 0x7fa7bf800000 length 0x200000 pfn 0x13722600 DEV|MAP radix_entry 0x11ec8000e fs-write-12957 [063] .... 3780.553212: dax_pmd_fault_done: dev 259:0 ino 0xd0 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x7fa7bf800000 vm_start 0x7fa7bf800000 vm_end 0x7fa7bfc00000 pgoff 0x0 max_pgoff 0x3ff NOPAGE
Signed-off-by: Fan Du fan.du@intel.com