crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.45k stars 1.62k forks source link

`File.read`ing a file bigger than 2GB results in an `OverflowError` #14485

Open koute opened 6 months ago

koute commented 6 months ago

Unlike in Ruby it seems like File.reading a file that's at least 2GB in size doesn't work.

Consider the following:

$ dd if=/dev/zero of=/tmp/dummy bs=1M count=2048
$ echo 'File.read("/tmp/dummy")' > test.cr
$ crystal ./test.cr
Unhandled exception: Arithmetic overflow (OverflowError)
  from /usr/lib/crystal/file.cr:748:26 in 'read'
  from /usr/lib/crystal/file.cr:739:3 in 'read'
  from test.cr:1:1 in '__crystal_main'
  from /usr/lib/crystal/crystal/main.cr:129:5 in 'main_user_code'
  from /usr/lib/crystal/crystal/main.cr:115:7 in 'main'
  from /usr/lib/crystal/crystal/main.cr:141:3 in 'main'
  from /usr/lib/libc.so.6 in '??'
  from /usr/lib/libc.so.6 in '__libc_start_main'

If you change the count=2048 to count=2047 then it works.

Related: https://github.com/crystal-lang/crystal/issues/3209

Crystal 1.11.2 (2024-03-02)

LLVM: 17.0.6
Default target: x86_64-pc-linux-gnu
straight-shoota commented 6 months ago

The limitation is based on Int32 being the type to represent the size of a collection (String in this case), so it can only have Int32::MAX elements.

Now I'm wondering if it really makes sense to actually have such a large String instance. It might be quite inconvenient because char indices are non-linear. So that specific detail might be up for debate.

But other collections, particularly Slice, should certainly be able to hold more than 2GB.

@koute Do you have a real use case where you need this or is it just a theoretical discussion? As a workaround, you can open the file and read it in chunks, performing operations on each individual chunk of data.

Related: https://github.com/crystal-lang/crystal/issues/8111#issuecomment-656319108, https://forum.crystal-lang.org/t/is-increasing-array-index-size-in-the-future/6610/2

straight-shoota commented 6 months ago

Also related is #4011. It was originally about getting no error when accessing an out-of-range array index (which is fixed). Now it's only about improving the error message to point out the reason for overflow. I suppose this would be an improvement for this use case as well.

koute commented 6 months ago

@koute Do you have a real use case where you need this or is it just a theoretical discussion?

Yes, this is a real use case.

I was looking to switch from Ruby to Crystal when writing my scripts to speed them up, and this was the very first issue I've encountered, which really surprised me (sure, I expected Crystal to not have automatic bigint promotion like Ruby has, but I didn't expect it to use 32-bit integers for something like this, which seems baffling to me considering how tiny 2GB is nowadays).

I use File.read in Ruby on big files all the time to process them, especially in my quick & dirty scripts. I know this can be worked around by e.g. reading the file in chunks, and if this was not a quick & dirty script I would certainly do that, but for quick one-off scripts I just want to minimize the friction while writing them.

HertzDevil commented 6 months ago

Not sure about Ruby's situation, but contiguous allocations in that size range will perform poorly with the Boehm GC, especially on Windows (contrast with #14395), so I don't think the standard library is going to accommodate them in the near future.

Memory-mapped I/O is a notable exception where you could have huge contiguous memory ranges without any allocation. To create a read-only view on Windows:

lib LibC
  PAGE_READONLY = 0x02

  FILE_MAP_READ = 0x0004

  fun CreateFileMappingA(hFile : HANDLE, lpFileMappingAttributes : SECURITY_ATTRIBUTES*, flProtect : DWORD, dwMaximumSizeHigh : DWORD, dwMaximumSizeLow : DWORD, lpName : LPSTR) : HANDLE
  fun MapViewOfFile(hFileMappingObject : HANDLE, dwDesiredAccess : DWORD, dwFileOffsetHigh : DWORD, dwFileOffsetLow : DWORD, dwNumberOfBytesToMap : SizeT) : Void*
  fun UnmapViewOfFile(lpBaseAddress : Void*) : BOOL
end

File.open(...) do |file|
  handle = Crystal::System::FileDescriptor.windows_handle(file.fd)
  size = file.size
  mapping = LibC.CreateFileMappingA(handle, nil, LibC::PAGE_READONLY,
    LibC::DWORD.new!(size >> 32), LibC::DWORD.new!(size), nil)
  view = LibC.MapViewOfFile(mapping, LibC::FILE_MAP_READ, 0, 0, 0).as(UInt8*)

  # this should be okay even if `size > Int32::MAX`
  # bytes = Bytes.new(view, size, read_only: true)
  # io = IO::Memory.new(bytes, writeable: false)

  LibC.UnmapViewOfFile(view)
  LibC.CloseHandle(mapping)
end

or on Unix-like systems:

File.open(...) do |file|
  size = file.size
  view = LibC.mmap(nil, size, LibC::PROT_READ, LibC::MAP_PRIVATE, file.fd, 0).as(UInt8*)

  # this should be okay even if `size > Int32::MAX`
  # bytes = Bytes.new(view, size, read_only: true)
  # io = IO::Memory.new(bytes, writeable: false)

  LibC.munmap(view, size)
end

If Slice does support 64-bit sizes, then bytes could probably act as a drop-in replacement for File.read or, more precisely, File.open(..., &.getb_to_end). io would also work as long as you don't need any IO::FileDescriptor-specific functionality.

ysbaddaden commented 6 months ago

@HertzDevil Nice! I've been wondering about mmap for this use case, and this looks exciting.

crysbot commented 6 months ago

This issue has been mentioned on Crystal Forum. There might be relevant details there:

https://forum.crystal-lang.org/t/built-in-support-for-mmap/6772/1

crysbot commented 6 months ago

This issue has been mentioned on Crystal Forum. There might be relevant details there:

https://forum.crystal-lang.org/t/built-in-support-for-mmap/6772/2

crysbot commented 3 weeks ago

This issue has been mentioned on Crystal Forum. There might be relevant details there:

https://forum.crystal-lang.org/t/is-increasing-array-index-size-in-the-future/6610/20