NVSL / linux-nova

NOVA is a log-structured file system designed for byte-addressable non-volatile memories, developed at the University of California, San Diego.
http://nvsl.ucsd.edu/index.php?path=projects/nova
Other
421 stars 118 forks source link

fs,nova: Allow huge block(2M) allocation #67

Closed fengyuleidian0615 closed 5 years ago

fengyuleidian0615 commented 6 years ago

This is a follow-up of https://github.com/NVSL/linux-nova/pull/64.

To leverage huge page mapping when doing mmap, both virtual address range as well as physical block is required to be aligned at huge page boundary.

This patch aims to support 2M allocation first as 1G huge page mapping is not yet supported in fs/dax level.

It's straight forward to try huge allocation first, fall back to un-aligned allocation if unlucky.

Note: We don't need to hard code NOVA_DEFAULT_BLOCK_TYPE to 2M when initiates inode structure, the page fault handler itself will try huge mapping, and then pte mapping if huge mapping is not welcomed.

Test fsdax trace log: fs-write-12957 [063] .... 3780.549974: dax_pmd_fault: dev 259:0 ino 0xd0 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x7fa7bf800000 vm_start 0x7fa7bf800000 vm_end 0x7fa7bfc00000 pgoff 0x0 max_pgoff 0x3ff fs-write-12957 [063] .... 3780.553204: dax_pmd_insert_mapping: dev 259:0 ino 0xd0 shared write address 0x7fa7bf800000 length 0x200000 pfn 0x13722600 DEV|MAP radix_entry 0x11ec8000e fs-write-12957 [063] .... 3780.553212: dax_pmd_fault_done: dev 259:0 ino 0xd0 shared WRITE|ALLOW_RETRY|KILLABLE|USER address 0x7fa7bf800000 vm_start 0x7fa7bf800000 vm_end 0x7fa7bfc00000 pgoff 0x0 max_pgoff 0x3ff NOPAGE

Signed-off-by: Fan Du fan.du@intel.com

fengyuleidian0615 commented 6 years ago

Hi @Andiry

Just wondering, could you please help to review this change to add huge block allocation?

Thanks!

Andiry commented 6 years ago

Thank you for posting the patch! Yes, I will review them ASAP. For the past two weeks I was busy on thesis and defense.

fengyuleidian0615 commented 6 years ago

ok, got it :)

fengyuleidian0615 commented 6 years ago

This requires request to specify 512 pages allocation explicitly. Does application ask for 512 page allocation if it wants huge page mmap?

User does not explicitly request huge allocation, this is automatically done in the following scenario.

  1. Application mmap file
  2. Then write to mapped virtual address
  3. PF triggered, generic page handler try to do PMD fault

  4. In PMD fault path, we need PMD size block to back virtual address
  5. Allocate PMD size, 2MB page, this will run into the place where this fix try to fix.
  6. if lucky, we have 2MB huge mapping, otherwise fall back to PTE fault.

For non-mmap case, the allocated block size depends on the size of data user want to write, so if user request to write 2MB data, then we try to allocated 2MB aligned block first.

As for the deadlock issue, I will think more about it, thanks for the pointer. Please don't hesitate to share your thoughts anyway :)

Andiry commented 6 years ago

I do have a thought about the allocation. This patch performs a O(n) search for 2MB ranges. Considering that rb_next() is O(logn), the actual complexity is O(nlogn). In theory it should not be too bad since we merge the nodes and keep the red-black tree compact, but I would like to have a O(1) or O(logn) allocator as allocation is performed frequently in NOVA.

Here is the idea: During initialization, break NVM range into 2MB blocks, and managed with a linked list (or a red-black tree).

For 2MB page, simply grabs a 2MB block and return. For 4KB page, we can allocation a 2MB block and break it into 512 4KB pages, or we can allocate from existing shard 4KB pages. We can use a red-black tree to manage shard pages in each broke 2MB page.

For deallocation, 2MB block simply adds back to the linked list. Small pages will merge with other existing 4KB pages, and if it formats a 2MB page, add back to the linked list. How does it sound?

fengyuleidian0615 commented 6 years ago

Thanks for sharing your previous comments and deep thoughts Andiry! I have updated the patch per your request, please help to review.

Before heading to the direction of this patch looks like, honestly I think about more alternatives actually. The complexity grow along with my thoughts drifted far away... First thought is naive to set allocator to be 2MB aware only, that's easy to implement, but way to waste of space in small file case. Then why not side aside space dedicated for 2MB, whether managed by rb tree or link listed as you suggested on a per-cpu basis, while it's not clear to me how much size(sensible) to put aside for 2MB space or build knob/mod param for it, more importantly block node state should be saved gracefully, it's bit of intrusive for the moment. Honestly speaking, out of respect for the original design, I choose the more natural gentle way, or minimal cleaner way to fit into current design, that's how this patch was born.

Yes, ideally a more robust full fledged allocator is needed to favor different scenario for future persistent memory filesystem, I thought about simple but scaleable buddy-system style too, that's fun to NOVA maybe. At the moment I'm just starting ramping up your work bit by bit, will put more effort on existing design, and will looking forward on your suggestions.

Andiry commented 6 years ago

Sure, I agree with you that we should improve NOVA piece by piece. This is a great start and I really appreciate your help. I will review your patches, perhaps by this weekend.

fengyuleidian0615 commented 6 years ago

v2:

@Andiry Thanks for your time to review! Indeed. I push a updated version based on your comments, and refine the code comments, and reload the new test log.

Please check it out!

fengyuleidian0615 commented 6 years ago

v3:

@Andiry Please review the updated version as per your suggestions. I will send you the test case by mail.

fengyuleidian0615 commented 6 years ago

v4:

Please review, 3Q

fengyuleidian0615 commented 6 years ago

v5: