CTCaer / switch-l4t-kernel-4.9

Other
46 stars 5 forks source link

only 39bit of virtual address space available #6

Closed theofficialgman closed 4 months ago

theofficialgman commented 4 months ago

using 39bits of virtual address space causes issues primarily on emulation of x64 software (such as wine with box64) and it would be better to build the kernel with 48bits of virtual address space. the following args (though all are not explicitly necessary since they are dependencies of one another) achieve that.

# guarantee 4K pages, this is the default
CONFIG_ARM64_4K_PAGES=y
# enable 48bits virtual address space size
CONFIG_ARM64_VA_BITS_48=y
# set default virtual address space size to 48bits (default when CONFIG_ARM64_VA_BITS_48=y)
CONFIG_ARM64_VA_BITS=48

I have tested these args and they solve the issue I was seeing without causing any noticeable regressions (ram usage, tlb miss latency, gpu and cpu performance)

CTCaer commented 4 months ago

Certainly not. If x64 windows software emulation wants to stop being an emulation software, it will not impose the burden on the device. The implications of using 4 levels of translation on an extremely aged cpu with tiny caches and low ram are quite high. Especially when it's "just for fun/niche" and a less than subpar gaming experience because of that cpu.

Zero non-server aarch64 device uses that or even going to do so with a device curated kernel config. So someone can't just impose that change to millions of users, especially when it's easier to break sw made for the common used cfg than the other way around (it's very bad for a server sw to assume max va and break).

Every mmap implementation is designed to always give you any available mapping if requested one is not possible. That's also how you find out max va supported (and not randomly looking in procfs). It's not normal for sw to randomly decide one day and enforce a random new va limit and wanting hundreds of TBs of VA instead of the half TB it had.

So if you want that fixed, that's the wrong fix and here's the wrong repo.

And putting aside the perf implications, you can't even 4k map more than 0.2% of it (yeah that's not 2%). Where's the 39bit one at least can be mapped with only 1GB of ram wasted.

Any software should just check max and just go and normally use mmap with out of bounds or proper request and just use what it got back, since it's an available one.

theofficialgman commented 4 months ago

common used cfg

48bits is the common used config

Snippit from my discord message:

also just an fyi, upstream linux kernel on arm64 uses VA 52bits as default with a fallback (at runtime) to 48bits if the hardware doesn't support it that change happened in february https://github.com/torvalds/linux/commit/5d101654226d64ac0a6928019fbf476b46e9d14b but the example defconfig always had 48bits enabled since 2016 (kernel 4.7) so its not abnormal to have this, it is actually normal

theofficialgman commented 4 months ago

also this:

tlb from lmbench (https://manpages.ubuntu.com/manpages/noble/en/man8/lmbench.8.html) reports within margin of error tlb miss latency between the kernels (both ~16ns with /usr/lib/lmbench/bin/tlb -c) maybe marginally higher with 48bits VA (like 5%). hard to tell because the output has some variance of +/- 1ns

I've tested games and benchmarks and see no difference in performance at all I'm sure there is some microbenchmark that will show a difference but high level I am not seeing anything. If you have any microbenchmarks where you see a quantifiable difference I would be interested in the results

theofficialgman commented 4 months ago

And putting aside the perf implications, you can't even 4k map more than 0.2% of it (yeah that's not 2%). Where's the 39bit one at least can be mapped with only 1GB of ram wasted.

thats true.

Any software should just check max and just go and normally use mmap with out of bounds or proper request and just use what it got back, since it's an available one.

I agree but it appears at at least "some" programs (both windows and linux) require 48bit address space which is why wine (non-fatally) spams the log when it is unable to detect 48bits of available address space.

CTCaer commented 4 months ago

Don't spam, here's not a chat.

48bits is the common used config

That falls into "tell me you never did a bring up (or checked any aarch64 kernel) without telling me" btw You can't just do that mistake when you see a single menu cfg or default cfg for hundreds of aarch64 SoCs. No one is gonna use that, the same way they didn't use 48 from the default cfg one before. Mainline provides "working" defaults. Most of them are just plainly bad or not working at all and aimed for qemu/fpga/sim bring ups.

lmbench

That it even showed a 5% difference with how it tests that, makes it even worse.

I actually wrote the previous message in that detail to avoid indulging the "I found a new shiny magic thing that somehow no one tried and used before and did some random tests on averaging metrics and everything is ok, so now everyone must use it".

If you want to learn why that increases memory accesses by ~25-33% and the implications it has on instruction/data fetching when it misses you can read one of the thousand papers explaining that, plus the core differences between a classic amd64 TLB and coretx-a57 r1p1 in aarch64 mode. Also why it's commonly suggested to always use 2MB aligned mapping on an amd64 cpu for performance critical apps (that includes games, ofc, that are even doing 1GB mapping to avoid that issue even more). And last but not least, what extra implications that has for normally backed pages and how ram size affects it.

at least "some" programs (both windows and linux) require 48bit address space

None do. Only badly coded/hardcoded ones that check the range for no reason or use wrong masking. As for translation layers and emulation, that's why they exist, send a mapping back and done. You don't even need to remap. Just set address region end in these correctly, either dynamically or hardcoded and see that "magically" everything works even if it whined before. Not even natively a 48bit enabled VA will give you what va you requested if it's already mapped or cut off from other imposed limit, they just give you available va back.

To sum up, your switch is not your pc and neither is an X1 elite or M1 (which that doesn't matter there ofc because of 16KB pages) or whatever.

There are some nice use cases for extra VA without backing it with ram, but these are not one of them. And even these use cases do not need to do that.