Open koute opened 6 months ago
The limitation is based on Int32
being the type to represent the size of a collection (String
in this case), so it can only have Int32::MAX
elements.
Now I'm wondering if it really makes sense to actually have such a large String
instance. It might be quite inconvenient because char indices are non-linear. So that specific detail might be up for debate.
But other collections, particularly Slice
, should certainly be able to hold more than 2GB.
@koute Do you have a real use case where you need this or is it just a theoretical discussion? As a workaround, you can open the file and read it in chunks, performing operations on each individual chunk of data.
Related: https://github.com/crystal-lang/crystal/issues/8111#issuecomment-656319108, https://forum.crystal-lang.org/t/is-increasing-array-index-size-in-the-future/6610/2
Also related is #4011. It was originally about getting no error when accessing an out-of-range array index (which is fixed). Now it's only about improving the error message to point out the reason for overflow. I suppose this would be an improvement for this use case as well.
@koute Do you have a real use case where you need this or is it just a theoretical discussion?
Yes, this is a real use case.
I was looking to switch from Ruby to Crystal when writing my scripts to speed them up, and this was the very first issue I've encountered, which really surprised me (sure, I expected Crystal to not have automatic bigint promotion like Ruby has, but I didn't expect it to use 32-bit integers for something like this, which seems baffling to me considering how tiny 2GB is nowadays).
I use File.read
in Ruby on big files all the time to process them, especially in my quick & dirty scripts. I know this can be worked around by e.g. reading the file in chunks, and if this was not a quick & dirty script I would certainly do that, but for quick one-off scripts I just want to minimize the friction while writing them.
Not sure about Ruby's situation, but contiguous allocations in that size range will perform poorly with the Boehm GC, especially on Windows (contrast with #14395), so I don't think the standard library is going to accommodate them in the near future.
Memory-mapped I/O is a notable exception where you could have huge contiguous memory ranges without any allocation. To create a read-only view on Windows:
lib LibC
PAGE_READONLY = 0x02
FILE_MAP_READ = 0x0004
fun CreateFileMappingA(hFile : HANDLE, lpFileMappingAttributes : SECURITY_ATTRIBUTES*, flProtect : DWORD, dwMaximumSizeHigh : DWORD, dwMaximumSizeLow : DWORD, lpName : LPSTR) : HANDLE
fun MapViewOfFile(hFileMappingObject : HANDLE, dwDesiredAccess : DWORD, dwFileOffsetHigh : DWORD, dwFileOffsetLow : DWORD, dwNumberOfBytesToMap : SizeT) : Void*
fun UnmapViewOfFile(lpBaseAddress : Void*) : BOOL
end
File.open(...) do |file|
handle = Crystal::System::FileDescriptor.windows_handle(file.fd)
size = file.size
mapping = LibC.CreateFileMappingA(handle, nil, LibC::PAGE_READONLY,
LibC::DWORD.new!(size >> 32), LibC::DWORD.new!(size), nil)
view = LibC.MapViewOfFile(mapping, LibC::FILE_MAP_READ, 0, 0, 0).as(UInt8*)
# this should be okay even if `size > Int32::MAX`
# bytes = Bytes.new(view, size, read_only: true)
# io = IO::Memory.new(bytes, writeable: false)
LibC.UnmapViewOfFile(view)
LibC.CloseHandle(mapping)
end
or on Unix-like systems:
File.open(...) do |file|
size = file.size
view = LibC.mmap(nil, size, LibC::PROT_READ, LibC::MAP_PRIVATE, file.fd, 0).as(UInt8*)
# this should be okay even if `size > Int32::MAX`
# bytes = Bytes.new(view, size, read_only: true)
# io = IO::Memory.new(bytes, writeable: false)
LibC.munmap(view, size)
end
If Slice
does support 64-bit sizes, then bytes
could probably act as a drop-in replacement for File.read
or, more precisely, File.open(..., &.getb_to_end)
. io
would also work as long as you don't need any IO::FileDescriptor
-specific functionality.
@HertzDevil Nice! I've been wondering about mmap for this use case, and this looks exciting.
This issue has been mentioned on Crystal Forum. There might be relevant details there:
https://forum.crystal-lang.org/t/built-in-support-for-mmap/6772/1
This issue has been mentioned on Crystal Forum. There might be relevant details there:
https://forum.crystal-lang.org/t/built-in-support-for-mmap/6772/2
This issue has been mentioned on Crystal Forum. There might be relevant details there:
https://forum.crystal-lang.org/t/is-increasing-array-index-size-in-the-future/6610/20
Unlike in Ruby it seems like
File.read
ing a file that's at least 2GB in size doesn't work.Consider the following:
If you change the
count=2048
tocount=2047
then it works.Related: https://github.com/crystal-lang/crystal/issues/3209