Use as SWAP - Githubissues

snshn commented 9 years ago

Was wondering if it could be possible to host a swap partition within vramfs or somehow patch vramfs to make it work as a swap partition?

My drive is encrypted, therefore I don't use SWAP partitions... but if this thing could give me 3GB or so of a swap-like fs, we could be onto something...

Do you think it could work without fuse, natively?

Oh, and great idea behind vramfs, really neat!

Overv commented 9 years ago

It's possible to implement a block device with OpenCL backing it. It could probably be developed pretty quickly with something like BUSE.

ptman commented 9 years ago

If you can provide a block device, then you can also build RAID-0 on top of the block devices.

Overv commented 9 years ago

@ptman That is a great point. I'm going to look into writing a kernel module to do this tomorrow. I've tried BUSE, but it seems to be bottlenecking because it's based on the network block device interface.

snshn commented 9 years ago

A kernel module and a some kind of analogue to swapon/swapoff would make this thing look very serious.

Both FUSE and BUSE would definitely only slow things down.

Good luck @Overv, thanks for sharing!

Overv commented 9 years ago

I've done some preliminary testing with BUSE and trivial OpenCL code. The read speed is 1.1 GB/s and the write speed 1.5 GB/s with ext4. Writing my own kernel module is going to take more time, and it'll still require a userspace daemon to interact with OpenCL.

snshn commented 9 years ago

Wow, very good news, @Overv!

I think the daemon is necessary just to provide the proper RAID support across multiple vramfs-based block devices and to control the amount of memory dedicated per adapter... I believe a package named vramfs-tools containing vramfsd and vramfsctl could fit the purpose...

Wondering what @torvalds will think of this project, maybe it'll end up being included in the tree like tmpfs... 4GB of VRAM on my Linux laptop feels like such a waste... bet I'm not the only one who feels that way.

Thanks for your work, once again!

agrover commented 9 years ago

if you want a userspace-backed block (SCSI) device I would encourage you to look at TCMU, which was just added to Linux 3.18. It's part of the LIO kernel target. Using it along with the loopback fabric and https://github.com/agrover/tcmu-runner may fill in some missing pieces. tcmu-runner handles the "you need a daemon" part so the work would just consist of a vram-backed plugin for servicing SCSI commands like READ and WRITE. Then you'd have the basic block device, for swap or a filesystem or whatever.

(tcmu-runner is still alpha but I think it would save you from writing kernel code and a daemon from scratch. feedback welcome.)

bisqwit commented 4 years ago

While it is technically possible to create a file on VRAMFS and use it as a swap, this is risky: What happens if VRAMFS itself, or one of the GPU libraries, gets swapped? This can happen in a low-mem situation, i.e exactly in a situation that swap is designed to help. The kernel cannot possibly know that restoring data from the swap depends on the data that is… swapped in the swap. This is not an issue for kernel-space filesystem/storage drivers because the kernel’s own RAM never gets swapped, but it is a conundrum for user-space stuff.

j123b567 commented 4 years ago

For kernel-space driver, it would be nice to use directly TTM/GEM to allocate video ram buffers.

bisqwit commented 4 years ago

What are TTM/GEM?

Note that the slram/phram/mtdblock thing can only access at most like 256 MB of the memory, the size of the memory window (I guess) of the PCI device.

j123b567 commented 4 years ago

I don't know much, but they are some interfaces to acces GPU memory inside kernel. So it can see all the GPU memory, not only some mapped part directly accesible. https://www.kernel.org/doc/html/latest/gpu/drm-mm.html

My situation, NVidia dedicated GPU with 4GB RAM and nouveau driver without OpenCL support. This memory is not mapped to memory space so I can't use them using slram/phram.

dhalsimax commented 4 years ago

It's possible to implement a block device with OpenCL backing it. It could probably be developed pretty quickly with something like BUSE.

The easy way to accomplish is to use vmrafs as is, make a file on vramfs disk then use a loop device on that file, format the loop device with mkswap and then swapon. With this method everything seems to work as I tried. Anyway the big issue using FUSE or BUSE is that both runs in user space and user space is swappable. I have not tried it, but suppose the memory of the vramfs process get swapped itself by the kernel, how would the kernel be able to recover by a page fault as it needs to reload in the first place ? I am curious what will happen then?

Edit: sorry I was not reading the comments before as bisqwit already explained...anyway I've tried to use as swap after a while got system freezing need a hard reboot (switch off and on power sob)...

LHLaurini commented 4 years ago

What happens if VRAMFS itself, or one of the GPU libraries, gets swapped?

Couldn't mlockall be used to prevent vramfs from getting swapped?

montvid commented 3 years ago

Wonderful idea! I am runnnig an old headless server with a 1 gb ddr3 amd card opencl 1.1. I can use all the video ram as i use just ssh. Unfortunately vramfs does not let me create a swap file based swap I get "swapon: /mnt/vram/swapfile: swapon failed: Invalid argument". Can it be fixed? I see opencl 1.2 is merged into mesa 20.3 so good times ahead for this project.

wonghang commented 3 years ago

It doesn't work for me. Even I tried to mlockall() page for the userspace program. I think the nvidia driver allocated some memory that would be swapped. At some point, the computer will get into deadlock when memory is low.

I also tried the BUSE / nbd approach. It doesn't work for me as well.

I think we need to get into the nvidia driver, carefully develop a block device kernel driver and call these undocumented API:

cat /proc/kallsyms |grep rm_gpu_ops | sort -k 3
0000000000000000 t rm_gpu_ops_address_space_create  [nvidia]
0000000000000000 t rm_gpu_ops_address_space_destroy [nvidia]
0000000000000000 t rm_gpu_ops_bind_channel_resources    [nvidia]
0000000000000000 t rm_gpu_ops_channel_allocate  [nvidia]
0000000000000000 t rm_gpu_ops_channel_destroy   [nvidia]
0000000000000000 t rm_gpu_ops_create_session    [nvidia]
0000000000000000 t rm_gpu_ops_destroy_access_cntr_info  [nvidia]
0000000000000000 t rm_gpu_ops_destroy_fault_info    [nvidia]
0000000000000000 t rm_gpu_ops_destroy_session   [nvidia]
0000000000000000 t rm_gpu_ops_device_create [nvidia]
0000000000000000 t rm_gpu_ops_device_destroy    [nvidia]
0000000000000000 t rm_gpu_ops_disable_access_cntr   [nvidia]
0000000000000000 t rm_gpu_ops_dup_address_space [nvidia]
0000000000000000 t rm_gpu_ops_dup_allocation    [nvidia]
0000000000000000 t rm_gpu_ops_dup_memory    [nvidia]
0000000000000000 t rm_gpu_ops_enable_access_cntr    [nvidia]
0000000000000000 t rm_gpu_ops_free_duped_handle [nvidia]
0000000000000000 t rm_gpu_ops_get_channel_resource_ptes [nvidia]
0000000000000000 t rm_gpu_ops_get_ecc_info  [nvidia]
0000000000000000 t rm_gpu_ops_get_external_alloc_ptes   [nvidia]
0000000000000000 t rm_gpu_ops_get_fb_info   [nvidia]
0000000000000000 t rm_gpu_ops_get_gpu_info  [nvidia]
0000000000000000 t rm_gpu_ops_get_non_replayable_faults [nvidia]
0000000000000000 t rm_gpu_ops_get_p2p_caps  [nvidia]
0000000000000000 t rm_gpu_ops_get_pma_object    [nvidia]
0000000000000000 t rm_gpu_ops_has_pending_non_replayable_faults [nvidia]
0000000000000000 t rm_gpu_ops_init_access_cntr_info [nvidia]
0000000000000000 t rm_gpu_ops_init_fault_info   [nvidia]
0000000000000000 t rm_gpu_ops_memory_alloc_fb   [nvidia]
0000000000000000 t rm_gpu_ops_memory_alloc_sys  [nvidia]
0000000000000000 t rm_gpu_ops_memory_cpu_map    [nvidia]
0000000000000000 t rm_gpu_ops_memory_cpu_ummap  [nvidia]
0000000000000000 t rm_gpu_ops_memory_free   [nvidia]
0000000000000000 t rm_gpu_ops_own_page_fault_intr   [nvidia]
0000000000000000 t rm_gpu_ops_p2p_object_create [nvidia]
0000000000000000 t rm_gpu_ops_p2p_object_destroy    [nvidia]
0000000000000000 t rm_gpu_ops_pma_alloc_pages   [nvidia]
0000000000000000 t rm_gpu_ops_pma_free_pages    [nvidia]
0000000000000000 t rm_gpu_ops_pma_pin_pages [nvidia]
0000000000000000 t rm_gpu_ops_pma_register_callbacks    [nvidia]
0000000000000000 t rm_gpu_ops_pma_unpin_pages   [nvidia]
0000000000000000 t rm_gpu_ops_pma_unregister_callbacks  [nvidia]
0000000000000000 t rm_gpu_ops_query_caps    [nvidia]
0000000000000000 t rm_gpu_ops_query_ces_caps    [nvidia]
0000000000000000 t rm_gpu_ops_release_channel   [nvidia]
0000000000000000 t rm_gpu_ops_release_channel_resources [nvidia]
0000000000000000 t rm_gpu_ops_report_non_replayable_fault   [nvidia]
0000000000000000 t rm_gpu_ops_retain_channel    [nvidia]
0000000000000000 t rm_gpu_ops_retain_channel_resources  [nvidia]
0000000000000000 t rm_gpu_ops_service_device_interrupts_rm  [nvidia]
0000000000000000 t rm_gpu_ops_set_page_directory    [nvidia]
0000000000000000 t rm_gpu_ops_stop_channel  [nvidia]
0000000000000000 t rm_gpu_ops_unset_page_directory  [nvidia]

to create a GPU session and allocate GPU memory in order to make a GPU swap truly possible.

azureblue commented 2 years ago

Hi guys, any update on this? Has anyone been able to reliably use VRAM as swap?

bisqwit commented 2 years ago

It only works if the following two conditions are met: 1) The GPU driver code/data is never put in swap 2) The vramfs driver code/data is never put in swap. If you somehow can guarantee these aspects, then using VRAM as swap will work.

montvid commented 2 years ago

Did not work for me the one time I tried it. Seems the project is abandoned...

wonghang commented 2 years ago

fuse should be able not to swap itself. But I attempted to add mlockall() in vramfs code, it didn't work either. It appears that GPU driver (nvidia) and CUDA libraries was swapped.

In nvidia driver, there are some undocumented functions (prefix by rm_, run cat /proc/kallsyms | grep nvidia to see) to access GPU memory. I think they are parts of GPUDirect RDMA (https://docs.nvidia.com/cuda/gpudirect-rdma/index.html). If we can somehow hack them and write a kernel driver to handle the paging, it may be possible to use GPU as swap.

Atrate commented 2 years ago

It is possible to achieve this, see https://wiki.archlinux.org/title/Swap_on_video_RAM , section FUSE.

The vramfs driver code/data is never put in swap.

This can be achieved with https://wiki.archlinux.org/title/Swap_on_video_RAM#Complete_system_freeze_under_high_memory_pressure

I tested it under high memory pressure (stress -m 10 --vm-bytes 3G --vm-hang 10 on a 32G system) and it didn't fall over, but only after applying the aforementioned fix.

bisqwit commented 2 years ago

This looks like a proper solution indeed.

Atrate commented 2 years ago

I've tried implementing mlockall. If you want to, you can test whether it works for you and fixes deadlocks without needing to use a systemd service.

https://github.com/Overv/vramfs/pull/32

twobombs commented 1 year ago

I would like to add to this discussion that the addition of vramfs as a block device would help using vramfs as a dedicated L2ARC ZFS buffer.

We are using very big dedicated nvme swap raid arrays for quantum computing and need something that is faster then 8-16 NVME sticks in RAID to collect the IO in a buffer that is not in main memory.

We make use of a lot of (virtual) memory so an L2ARC buffer in vram would be awesome; the GPUs would get a new lease on life because we went to CPU only calculation because of huge memory requirements to store the eigen vector (think 8/16TB)

Atrate commented 1 year ago

I would like to add to this discussion that the addition of vramfs as a block device would help using vramfs as a dedicated L2ARC ZFS buffer.

We are using very big dedicated nvme swap raid arrays for quantum computing and need something that is faster then 8-16 NVME sticks in RAID to collect the IO in a buffer that is not in main memory.

We make use of a lot of (virtual) memory so an L2ARC buffer in vram would be awesome; the GPUs would get a new lease on life because we went to CPU only calculation because of huge memory requirements to store the eigen vector (think 8/16TB)

@twobombs

You can make a loop device with losetup but NVME RAID will probably be faster than vramswap, the performance is still somewhat lacking in certain areas.

twobombs commented 1 year ago

@Atrate thank you very much for the loop solution. will look into this and if ZFS will allow a loop device as cache. the swap I/O usage pattern is random read/write, not stream. a PCIe VRAM device might offer better speeds whilst at the same time making the workload on NVME raid devices more 'stream'-lined when changes are comitted to the array.

twobombs commented 1 year ago

I went a step further and added VRAM cache files for ZFS based SWAP. It is fairly hilarious to see IO come through on NVTOP

Screenshot_from_2023-02-01_20-10-18

aedalzotto commented 1 year ago

It is possible to achieve this, see https://wiki.archlinux.org/title/Swap_on_video_RAM , section FUSE.

The vramfs driver code/data is never put in swap.

This can be achieved with https://wiki.archlinux.org/title/Swap_on_video_RAM#Complete_system_freeze_under_high_memory_pressure

I tested it under high memory pressure (stress -m 10 --vm-bytes 3G --vm-hang 10 on a 32G system) and it didn't fall over, but only after applying the aforementioned fix.

The solution seems to work for me, but when I increase swappiness from 10 to 180, it simply freezes. The same happens without increasing swappiness when running mprime.

I am running vramfs as a service, as the workaround cited above suggests. The only thing I think I am doing different is using a loopback, as my swapfile is being created with holes.

Does anyone have an idea of what is happening?

UPDATE: I tracked the last boot journal, and it stated the following error: kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] ERROR [CRTC:82:crtc-0] hw_done or flip_done timed out

Atrate commented 1 year ago

In reply to: https://github.com/Overv/vramfs/issues/3#issuecomment-1663130681

As suggested by Fanzhuyifan and others above I think that may be due to other GPU-management processes/libraries getting swapped out and maaaybe a fix is possible with a lot of systemd unit editing but that'd require tracking down every single library and process that is required for the operation of a dGPU and that seems like a chore.

fanzhuyifan commented 1 year ago

According to the documentation of mlockall,

mlockall() locks all pages mapped into the address space of the calling process. This includes the pages of the code, data, and stack segment, as well as shared libraries, user space kernel data, shared memory, and memory-mapped files. All mapped pages are guaranteed to be resident in RAM when the call returns successfully; the pages are guaranteed to stay in RAM until later unlocked.

So shared libraries directly used by vramfs being swapped out should not be the reason of system freezes.

Edit: Examining the resident size and virtual memory size of the vramfs process, I think the issue is that vramfs asks for additional memory to serve reads/writes.

Atrate commented 1 year ago

Edit: Examining the resident size and virtual memory size of the vramfs process, I think the issue is that vramfs asks for additional memory to serve reads/writes.

Is it? mlockall is called with the MCL_CURRENT | MCL_FUTURE flags, so it should also prevent all future allocations of memory from being swapped, unless I misunderstood the documentation.

Code in vramfs: https://github.com/Overv/vramfs/blob/829b1f2c259da2eb63ed3d4ddef0eeddb08b99e4/src/vramfs.cpp#L534

Documentation:

       MCL_CURRENT
              Lock all pages which are currently mapped into the address
              space of the process.

       MCL_FUTURE
              Lock all pages which will become mapped into the address
              space of the process in the future.  These could be, for
              instance, new pages required by a growing heap and stack
              as well as new memory-mapped files or shared memory
              regions.

fanzhuyifan commented 1 year ago

Edit: Examining the resident size and virtual memory size of the vramfs process, I think the issue is that vramfs asks for additional memory to serve reads/writes.

Is it? mlockall is called with the MCL_CURRENT | MCL_FUTURE flags, so it should also prevent all future allocations of memory from being swapped, unless I misunderstood the documentation.

Here are the steps to prove my point (on linux):

Start vramfs, say creating a filesystem with size 2000MB, and find the PID of the process.
Run cat /proc/PID/status | grep Vm to find the memory information. On a particular run on my computer I got

VmPeak: 7060808 kB VmSize: 7060808 kB VmLck: 6990308 kB VmPin: 0 kB VmHWM: 275588 kB VmRSS: 275588 kB VmData: 144976 kB VmStk: 164 kB VmExe: 132 kB VmLib: 14156 kB VmPTE: 628 kB VmSwap: 0 kB

Write random data to a file on the vramfs, and check memory usage again. First run dd if=/dev/random of=/tmp/vram/swapfile bs=1M count=1000, and then I got:

VmPeak: 7585096 kB VmSize: 7388488 kB VmLck: 7317988 kB VmPin: 0 kB VmHWM: 286148 kB VmRSS: 286148 kB VmData: 156092 kB VmStk: 164 kB VmExe: 132 kB VmLib: 14156 kB VmPTE: 668 kB VmSwap: 0 kB

Note that the bolded entries all increased.

Let's read that file and check memory usage again. First run sha256sum /tmp/vram/swapfile, and then I got

VmPeak: 7585096 kB VmSize: 7462220 kB VmLck: 7391720 kB VmPin: 0 kB VmHWM: 296072 kB VmRSS: 296072 kB VmData: 165960 kB VmStk: 164 kB VmExe: 132 kB VmLib: 14156 kB VmPTE: 692 kB VmSwap: 0 kB

The bolded entries increased again.

I believe this proves that vramfs asks for more memory when serving read and write requests. I am not saying the extra memory is swapped. I am just saying that sometimes it asks for extra memory to serve read and write requests. I suspect that this is the reason the computer freezes when using vramfs as swap, even with the mlockall call. In a system with high memory pressure, the OS tries to swap certain memory pages to vramfs. To serve this request, vramfs needs to perform some writes to the vram, and in the process asks for more memory. Since there is already no available memory, the system freezes.

jnturton commented 1 year ago

Since there is already no available memory, the system freezes.

Wouldn't we see OOM Killer entries in the kernel logs in this case?

Overv / vramfs

Use as SWAP #3