OSC / ondemand

Supercomputing. Seamlessly. Open, Interactive HPC Via the Web
https://openondemand.org/
MIT License
278 stars 104 forks source link

files app: large number of files in directory issues #3801

Open stdweird opened 1 week ago

stdweird commented 1 week ago

Using ondemand 3.1.7, we have 2 kind of issues when using using the files app:

The memory issue seems a "feature" of jbuilder caching the json structures somehow. The slowness and amount of memory result (imho) from a way too verbose json being generated.

In particular, for every file, 3 url strings are generated, but it would be better if they are generated on the browser end. The structure is sum of prefix of directory part of the url and the file name; and the download url might have some fixed suffix as well.

I think i can modify the jbuilder code to create a lighter json; but i am stuck with the javascript/templating stuff that happens on the browser side. In particular, in _file_action_menu.html.erb, the file data is somehow passed as data, but i don't know where that comes from and/or how it is evaluated (or generated): can we access the files javascript variables inside the {{data.something}} templates; or can we manipulate whatever data is coming from before the templating happens? (why is it not pure javascript, at least that might be more consistent and easier to read ;)

anyway, help welcome

stdweird commented 6 days ago

i did some more measurements and not sure what to think of it

the size of the json is not the real issue: eg 10k files results with this very verbose json only a 4.5MB download (takes 10s to create and download the json). bigger issue are that eg a firefox tab to show this then takes 700MB of ram, and is slow to render. everything seems nice and linear

excpept for my real problem: the ruby dashboard on the ondemand server process grows from 150MB idle to 180MB. still nothing to worry about, but there is no way to release this. if i then also open another folder with 30k files, the dahsboard process jumps to 330MB, without releasing the memory. larger folders, more memory usage, more added on top and not release. adding files to a folder that was already opened increases the memory with number of extra files (eg adding 1k files in a 30k files folder). renaming a folder that was opened before triggers an increase as well, but not as much as compared from scratch.

@johrstrom wouldn't it make more sense to only return eg the first 1-2k files in the json, and then showing some warning (and maybe provide some url arg ?showall=1 so users can bypass it if they know what they are doing).

johrstrom commented 6 days ago

Yea seems like we need to paginate very large directories.

stdweird commented 19 hours ago

more debugging revealed the memory issue: generating the large files json, causes ruby to create a lot of objects (i guess 50 to 80 per file). these are recycled by the ruby GC but not released from memory. calling GC.start and following GC.stat shows this. there is no real way around this; and eg a 1k files pagination will keep this all under control.

stdweird commented 17 hours ago

@johrstrom reading a bit more on pagination in datatables, i assume there is no "easy" fix to this. from what i see

to work around the memory issue on the server side, i might have a solution; but also needs some extra code: if we construct a sinlge large text blob in eg csv format, and send that as files data (instead of list of dicts), ruby doesn't have to make a lot of separate objects. the csv format will also be more compact then the current json, so that is also a nice benefit. strangely enough you can't read csv text in datatables; but converting the csv to json in the browser is not that hard (searching online gave some not very long javascript examples).

CSC-swesters commented 17 hours ago

@stdweird I'm following your debugging with interest, thanks for looking into this!

construct a sinlge large text blob in eg csv format, and send that as files data (instead of list of dicts)

I think there's a risk of painting ourselves into a corner here with file names containing all sorts of characters. Since Linux file names can contain any bytes except the NUL byte ('\0'), and the forward slash (/), the CSV library that is used must be able to escape characters that are part of file names, so that they aren't interpreted as field separators or quotation characters that are part of the CSV format, for example.

I wanted to raise this concern, to be sure that you've considered it.

For the record, files with invalid UTF-8 bytes in their names are currently being discarded by OOD on the server side. See PR #2626 so these are not shipped to the client side.

stdweird commented 16 hours ago

@CSC-swesters i have not considered anything ;) to generate the csv format we indeed should be aware of unicode stuff and test it. but indeed we should be careful. if we put the filename last column the splitting should be more robust (IF we can get rid of the urls that also have the name in them). to be clear, to avoid the overload of object being created, we probably need to generate the string ourself; not use some ruby gem for it.

johrstrom commented 11 hours ago

At some point we'd likely rewrite this to use turbo_streams instead of json responses.

these are recycled by the ruby GC but not released from memory.

Isn't that the difference between free and available memory in Linux? I.e., these addresses are available but not free?