amazonlinux / amazon-linux-2023

Amazon Linux 2023
https://aws.amazon.com/linux/amazon-linux-2023/
Other
540 stars 40 forks source link

[Bug] - Md bitmap writes to unallocated pages #797

Open meir6264 opened 2 months ago

meir6264 commented 2 months ago

Md bitmap writes to unallocated pages

A bug was discovered in the md driver. bitmap write exceeded the page limit. the patch was already approved and merged to md6.11(commit ab99a87542f194f28e2364a42afbf9fb48b1c724). Can you please patch it to Amaz Linux 2023 as soon as possible? this bug may cause a crash.

Patch Description: __write_sb_page() rounds up the io size to the optimal io size if it doesn't exceed the data offset, but it doesn't check the final size exceeds the bitmap length.

For example: page count - 1 page size - 4K data offset - 1M optimal io size - 256K

The final io size would be 256K (64 pages) but md_bitmap_storage_alloc() allocated 1 page, the IO would write 1 valid page and 63 pages that happens to be allocated afterwards. This leaks memory to the raid device superblock.

This issue caused a data transfer failure in nvme-tcp. The network drivers checks the first page of an IO with sendpage_ok(), it returns true if the page isn't a slabpage and refcount >= 1. If the page !sendpage_ok() the network driver disables MSG_SPLICE_PAGES.

As of now the network layer assumes all the pages of the IO are sendpage_ok() when MSG_SPLICE_PAGES is on.

The bitmap pages aren't slab pages, the first page of the IO is sendpage_ok(), but the additional pages that happens to be allocated after the bitmap pages might be !sendpage_ok(). That cause skb_splice_from_iter() to stop the data transfer, in the case below it hangs 'mdadm --create'.

The bug is reproducible, in order to reproduce we need nvme-over-tcp controllers with optimal IO size bigger than PAGE_SIZE. Creating a raid with bitmap over those devices reproduces the bug.

In order to simulate large optimal IO size you can use dm-stripe with a single device. Script to reproduce the issue on top of brd devices using dm-stripe is attached below (will be added to blktest).

I have added some logs to test the theory: ... md: created bitmap (1 pages) for device md127 write_sb_page before md_super_write offset: 16, size: 262144. pfn: 0x53ee === write_sb_page before md_super_write. logging pages === pfn: 0x53ee, slab: 0 <-- the only page that allocated for the bitmap pfn: 0x53ef, slab: 1 pfn: 0x53f0, slab: 0 pfn: 0x53f1, slab: 0 pfn: 0x53f2, slab: 0 pfn: 0x53f3, slab: 1 ... nvme_tcp: sendpage_ok - pfn: 0x53ee, len: 262144, offset: 0 skbuff: before sendpage_ok() - pfn: 0x53ee skbuff: before sendpage_ok() - pfn: 0x53ef WARNING at net/core/skbuff.c:6848 skb_splice_from_iter+0x142/0x450 skbuff: !sendpage_ok - pfn: 0x53ef. is_slab: 1, page_count: 1

bjoernd commented 1 month ago

I tried running the reproducer from that upstream commit in AL2023. I am unable to reproduce any warning or bug. Have you actually experienced this issue in AL2023 with kernel 6.1?

meir6264 commented 1 month ago

Hi, I using 6.1.109-118.189.amzn2023 and I've noticed that it's missing this important fix. I've seen some problems that might caused by the writing to these unallocated pages (Although, I haven't proven it yet). Any reason why not to patch it?

bjoernd commented 1 month ago

Sure, I was just imploring about the urgency of the issue to help our kernel team prioritize this against other bug reports.