files app: large number of files in directory issues

stdweird commented 2 months ago

Using ondemand 3.1.7, we have 2 kind of issues when using using the files app:

it's slow (or doesn't even work when number of files is very large; eg 50k files)
the ruby dashboard process allocates lots of memory and doesn't release it (eg 500MB+ for 50k files)

The memory issue seems a "feature" of jbuilder caching the json structures somehow. The slowness and amount of memory result (imho) from a way too verbose json being generated.

In particular, for every file, 3 url strings are generated, but it would be better if they are generated on the browser end. The structure is sum of prefix of directory part of the url and the file name; and the download url might have some fixed suffix as well.

I think i can modify the jbuilder code to create a lighter json; but i am stuck with the javascript/templating stuff that happens on the browser side. In particular, in _file_action_menu.html.erb, the file data is somehow passed as data, but i don't know where that comes from and/or how it is evaluated (or generated): can we access the files javascript variables inside the {{data.something}} templates; or can we manipulate whatever data is coming from before the templating happens? (why is it not pure javascript, at least that might be more consistent and easier to read ;)

anyway, help welcome

stdweird commented 2 months ago

i did some more measurements and not sure what to think of it

the size of the json is not the real issue: eg 10k files results with this very verbose json only a 4.5MB download (takes 10s to create and download the json). bigger issue are that eg a firefox tab to show this then takes 700MB of ram, and is slow to render. everything seems nice and linear

excpept for my real problem: the ruby dashboard on the ondemand server process grows from 150MB idle to 180MB. still nothing to worry about, but there is no way to release this. if i then also open another folder with 30k files, the dahsboard process jumps to 330MB, without releasing the memory. larger folders, more memory usage, more added on top and not release. adding files to a folder that was already opened increases the memory with number of extra files (eg adding 1k files in a 30k files folder). renaming a folder that was opened before triggers an increase as well, but not as much as compared from scratch.

@johrstrom wouldn't it make more sense to only return eg the first 1-2k files in the json, and then showing some warning (and maybe provide some url arg ?showall=1 so users can bypass it if they know what they are doing).

johrstrom commented 2 months ago

Yea seems like we need to paginate very large directories.

stdweird commented 2 months ago

more debugging revealed the memory issue: generating the large files json, causes ruby to create a lot of objects (i guess 50 to 80 per file). these are recycled by the ruby GC but not released from memory. calling GC.start and following GC.stat shows this. there is no real way around this; and eg a 1k files pagination will keep this all under control.

stdweird commented 2 months ago

@johrstrom reading a bit more on pagination in datatables, i assume there is no "easy" fix to this. from what i see

client side datatables with pagination (still fetching json for all files), i presume this might solve the issue that when viewing large directories the browser mem usage blows up. but it will not fix the server side memory issue.
server side pagination is possible in theory, but you will have to move lot of the client logic to the server (the reloadtable method you now use needs to be seriously redesigned).
some mix where you try to hack both together might be possible, but will be ugly. you would need to be able to pass the pagination info the the reloadtable method, and this would have to play nice with whatever happens when pagination. not sure you can use the "ajax" option to eg set the start/offset parameters to some this._page_start and then also call reloadtable. you would still need to return a list of all files, but most entries in the list can be null; only the expected values for pagination need to be filled in)

to work around the memory issue on the server side, i might have a solution; but also needs some extra code: if we construct a sinlge large text blob in eg csv format, and send that as files data (instead of list of dicts), ruby doesn't have to make a lot of separate objects. the csv format will also be more compact then the current json, so that is also a nice benefit. strangely enough you can't read csv text in datatables; but converting the csv to json in the browser is not that hard (searching online gave some not very long javascript examples).

CSC-swesters commented 2 months ago

@stdweird I'm following your debugging with interest, thanks for looking into this!

construct a sinlge large text blob in eg csv format, and send that as files data (instead of list of dicts)

I think there's a risk of painting ourselves into a corner here with file names containing all sorts of characters. Since Linux file names can contain any bytes except the NUL byte ('\0'), and the forward slash (/), the CSV library that is used must be able to escape characters that are part of file names, so that they aren't interpreted as field separators or quotation characters that are part of the CSV format, for example.

I wanted to raise this concern, to be sure that you've considered it.

For the record, files with invalid UTF-8 bytes in their names are currently being discarded by OOD on the server side. See PR #2626 so these are not shipped to the client side.

stdweird commented 2 months ago

@CSC-swesters i have not considered anything ;) to generate the csv format we indeed should be aware of unicode stuff and test it. but indeed we should be careful. if we put the filename last column the splitting should be more robust (IF we can get rid of the urls that also have the name in them). to be clear, to avoid the overload of object being created, we probably need to generate the string ourself; not use some ruby gem for it.

johrstrom commented 2 months ago

At some point we'd likely rewrite this to use turbo_streams instead of json responses.

these are recycled by the ruby GC but not released from memory.

Isn't that the difference between free and available memory in Linux? I.e., these addresses are available but not free?

stdweird commented 2 months ago

@CSC-swesters wrt the utf8 issues, this might be the problem that the pun code runs with LANG=C. see https://github.com/OSC/ondemand/issues/3644

stdweird commented 2 months ago

@johrstrom here is some GC.stat data. so it look like 70 pages per file. the heap_available_slots is what is growing, and that is by design in the ruby GC. pages flow between used, marked and free, but they are not released back to the OS (it was easiest for me to add the the GC.stat logging in a application form, so that's why i use it here)

opening a interactive apps form. this is before and after a GC.start (this is about 113MB RSS)


App 3289729 output: [2024-09-27 09:57:16 +0200 ] ERROR "GC.stat {:count=>47, :time=>233, :heap_allocated_pages=>744, :heap_sorted_length=>744, :heap_allocatable_pages=>
0, :heap_available_slots=>303999, :heap_live_slots=>303521, :heap_free_slots=>478, :heap_final_slots=>0, :heap_marked_slots=>245763, :heap_eden_pages=>744, :heap_tomb_p
ages=>0, :total_allocated_pages=>744, :total_freed_pages=>0, :total_allocated_objects=>1271979, :total_freed_objects=>968458, :malloc_increase_bytes=>143264, :malloc_in
crease_bytes_limit=>16777216, :minor_gc_count=>38, :major_gc_count=>9, :compact_count=>0, :read_barrier_faults=>0, :total_moved_objects=>0, :remembered_wb_unprotected_o
bjects=>991, :remembered_wb_unprotected_objects_limit=>1624, :old_objects=>243137, :old_objects_limit=>357136, :oldmalloc_increase_bytes=>12834032, :oldmalloc_increase_
bytes_limit=>16777216}"
App 3289729 output: [2024-09-27 09:57:16 +0200 ] ERROR "GC.stat {:count=>48, :time=>286, :heap_allocated_pages=>745, :heap_sorted_length=>1003, :heap_allocatable_pages=>258, :heap_available_slots=>304407, :heap_live_slots=>246068, :heap_free_slots=>58339, :heap_final_slots=>0, :heap_marked_slots=>246064, :heap_eden_pages=>745, :heap_tomb_pages=>0, :total_allocated_pages=>745, :total_freed_pages=>0, :total_allocated_objects=>1272102, :total_freed_objects=>1026034, :malloc_increase_bytes=>26608, :malloc_increase_bytes_limit=>16777216, :minor_gc_count=>38, :major_gc_count=>10, :compact_count=>0, :read_barrier_faults=>0, :total_moved_objects=>0, :remembered_wb_unprotected_objects=>784, :remembered_wb_unprotected_objects_limit=>1568, :old_objects=>244077, :old_objects_limit=>488154, :oldmalloc_increase_bytes=>26608, :oldmalloc_increase_bytes_limit=>16777216}"

2. opening folder of 10k files
3. 2nd opening a interactive apps form. this is before and after a GC.start. this is with 184MB RSS

App 3289729 output: [2024-09-27 09:58:15 +0200 ] ERROR "GC.stat {:count=>96, :time=>1118, :heap_allocated_pages=>2324, :heap_sorted_length=>2324, :heap_allocatable_pages=>0, :heap_available_slots=>949588, :heap_live_slots=>948724, :heap_free_slots=>864, :heap_final_slots=>0, :heap_marked_slots=>639940, :heap_eden_pages=>2324, :heap_tomb_pages=>0, :total_allocated_pages=>2324, :total_freed_pages=>0, :total_allocated_objects=>8922494, :total_freed_objects=>7973770, :malloc_increase_bytes=>695712, :malloc_increase_bytes_limit=>16777216, :minor_gc_count=>82, :major_gc_count=>14, :compact_count=>0, :read_barrier_faults=>0, :total_moved_objects=>0, :remembered_wb_unprotected_objects=>757, :remembered_wb_unprotected_objects_limit=>1514, :old_objects=>638860, :old_objects_limit=>859374, :oldmalloc_increase_bytes=>9324056, :oldmalloc_increase_bytes_limit=>19737900}" App 3289729 output: [2024-09-27 09:58:15 +0200 ] ERROR "GC.stat {:count=>97, :time=>1216, :heap_allocated_pages=>2324, :heap_sorted_length=>2324, :heap_allocatable_pages=>0, :heap_available_slots=>949588, :heap_live_slots=>339826, :heap_free_slots=>609762, :heap_final_slots=>0, :heap_marked_slots=>339816, :heap_eden_pages=>2324, :heap_tomb_pages=>0, :total_allocated_pages=>2324, :total_freed_pages=>0, :total_allocated_objects=>8922579, :total_freed_objects=>8582753, :malloc_increase_bytes=>18664, :malloc_increase_bytes_limit=>16777216, :minor_gc_count=>82, :major_gc_count=>15, :compact_count=>0, :read_barrier_faults=>0, :total_moved_objects=>0, :remembered_wb_unprotected_objects=>753, :remembered_wb_unprotected_objects_limit=>1506, :old_objects=>338211, :old_objects_limit=>676422, :oldmalloc_increase_bytes=>18664, :oldmalloc_increase_bytes_limit=>19350882}"


so a lot of pages being used, and not released ever (or at a very slow rate. and GC does not clean based on age, but based on iteration. the decrease rate is too slow anyway to trigger a meaningful decrease in RSS with series of `GC.start` calls)

if you wonder why, well every string is an object 9so at least on page), so all values of the `@files` hash, and the json are created objects. (the keys in the ood code are mostly symbols so that is ok). to build the json with 11 values per file, jbuilder iterates of `@files`. this is a posixpath ls call, and code is https://github.com/OSC/ondemand/blob/master/apps/dashboard/app/models/posix_file.rb#L100

path.each_child.map do |child_path|
  PosixFile.new(child_path)
end.select(&:valid?)
    .map(&:to_h)
    .sort_by { |p| p[:directory] ? 0 : 1 }



there is no `lazy` being used here, so if i understand this correctly it makes a new object for each chained method. it starts with list of PosixFile instances (not sure how many pages per instance), then selects these (i assume no new pages here for each file, then creates hash of each file (9 values), and then sorted. building the json from the sorted list of hashes probably reuses the unmodified values from the hash, but adds at least 4 new ones

anyway, my analysis is probably not 100% correct ;), but for the purpose of building the json, looping over the files from the `ls` is a lot of overhead imho. i'll see if i can add a ls_json that creates the required json data in posix_file based on the stat data directly, and that can be used in jbuild directly

CSC-swesters commented 2 months ago

@CSC-swesters wrt the utf8 issues, this might be the problem that the pun code runs with LANG=C. see #3644

That's not really going to help. If you have bytes which cannot be decoded as valid text, changing the LANG variable would probably not help much. UTF-8 should be used, but it's completely trivial to create a byte sequence that cannot be decoded as UTF-8 text, which OOD then needs to handle. See for example issue #2624

Slurm doesn't enforce UTF-8 in its job names, so the analogy to Linux file names is a good one, but the solution in #3644 is not the correct one. This here is the correct solution, which allows you to get some readable text representation out of a byte sequence with non-valid UTF-8 bytes in it.

stdweird commented 2 months ago

@CSC-swesters ah, i misread the "invalid utf-8" part. do these files have a valid encoding (at all), or are these just corrupt filenames.

CSC-swesters commented 2 months ago

do these files have a valid encoding (at all)

No, but they don't need to. Linux handles file names as bytes, so applications should do that too.

are these just corrupt filenames.

No, they're not corrupt, that's the thing I wanted to point out in my original comment :slightly_smiling_face: Just because we cannot read them, it doesn't mean that it isn't a valid file name for Linux, so it needs to be supported or at least not cause crashes and problems.

stdweird commented 2 months ago

we are deviating from the original issue, but if you don't use correct encoding, you will get garbage. so when ood converts the stat binary data to string or to json, it should do that correctly; and send the correct encoding with the json so the browser can also do something with it. i guess something goes wrong somewhere in that chain. maybe you can send the binary stat data straight away to client in some other protocol, but the original encoding needs to be known. if you use a locale in linux that is not utf8 (or even C), ood won't pick this up by accident.

CSC-swesters commented 2 months ago

we are deviating from the original issue,

Yes, sorry for that.

but if you don't use correct encoding, you will get garbage. so when ood converts the stat binary data to string or to json, it should do that correctly; and send the correct encoding with the json so the browser can also do something with it. i guess something goes wrong somewhere in that chain.

Either OOD would need to drop invalid UTF-8 entries on the server side, or we could base64 encode any problematic file names if we really want to send them to the client. This quickly becomes a user experience problem in that users won't know there's a file with an unprintable name if we simply drop it. And base64 is not very user-friendly, even if it's technically usable.

There's a separate issue open for discussing the user interface-side of "special" files (perhaps ones with unprintable file names could be added there?) over in #3026

stdweird commented 2 months ago

in python you can set fallback mechanism for string encoding, but this is better handled by separate code so it's clear that the reported name is not really the correct one. the json now ships a unique id based on the inode, so that could also be used to avoid overlap/conflict in naming by adding something in the client code, and then something like https://ruby-doc.org/3.3.5/String.html#method-i-unicode_normalize can be used to generate the best possible name. ood files app maybe should support setting the encoding in the client/browser side, so if people know that a directory has eg UTF64 names, they can try to force this via the client (so the stat data is encoded in the ruby server code with user selected UTF64, and the received json is also handled appropriately in javascript (i would guess by passing the correct encoding in the response header might be enough?))

CSC-swesters commented 2 months ago

there is no lazy being used here, so if i understand this correctly it makes a new object for each chained method. it starts with list of PosixFile instances (not sure how many pages per instance), then selects these (i assume no new pages here for each file, then creates hash of each file (9 values), and then sorted.

Good find, it sounds like this could help with memory management :+1: I suppose the map and select operations are possible to do as lazy operations, based on the Enumerator::Lazy documentation. Only for the sort_by() call would you need to resolve the whole chain of operations.

stdweird commented 2 months ago

@CSC-swesters i am not an expert (i read here and there that chaining should "just" pass things around, but indeed the sort requires the whole sequence (although i am not sure it's relevant for the json output, as it re-sorts in the webclient i thiink. should be easy to try.

OSC / ondemand

files app: large number of files in directory issues #3801