Open OmegaPhil opened 11 years ago
1199773 files were being hashed.
Could it be the file list? Glancing at the code, it looks like that's all read into memory then sits around there. Might have to find a way to work a fixed-length list of files at a time to keep memory use consistently low.
A really dumb calculation is 1199773 * 200 characters (400B for really bad UTF-8 example) = 457.677078247MB - which is a lot, but Python is doing far worse. I'm vaguelly aware of a different internal string encoding, might be worse than UTF-8?
Do you run into this kind of problem a lot for reference? Ironically my smallest server has 3.32GB RAM... the RAM usage remains a joke, but I don't have a compelling case for fighting it atm.
Do you run into this kind of problem a lot for reference?
Nah. I doubt I have enough files for it to be a real problem. Could be if I had a bunch of storage hooked up to a Raspberry Pi or something I guess.
I'm vaguelly aware of a different internal string encoding, might be worse than UTF-8?
Not sure about Python internals. I know I've seen PHP pull similar shenanigans with arrays of smallish strings taking up a multiple more memory than back-of-envelope calculations would indicate they should, so my first thought was "I bet the whole file list is read into an array at once", and it looks like it is. I'm sure there are good/unavoidable reasons for wasting so much memory, and it's probably just a dynamic/scripting language thing.
(after some googling) looks like object size pools, general string/list overhead, and the GC only freeing memory internally and not back to the system (high-water-mark memory allocation—I've seen this with IIRC Ruby, too) could account for what's going on, if the file list is in fact the primary cause: Python memory use
So potentially extremely inefficient memory usage of a basic list... something to look into if I work on this anyway.
Only way to lower that significantly is probably to read in a fixed count of entries (maybe even just one), work on those, then read the next batch. Treat it like a buffer. That'd put an upper bound on the memory used by the filenames list. Unfortunately I don't see a way to do that without restructuring the code quite a bit.
Had a very large hashing task on a server with 1GB RAM - and python ate ~700MB, and probably some swap since that went from very little to 400MB.
All its got to do is generate a list of things to hash, then do it - I doubt the list can justify so much data wastage.
It would be useful to review how memory is used to see if it is doing something retarded, but such a large task is extremely rare, so I dont think this issue is very important.