Reduce memory usage for loading DwC-A file

adityajain07 commented 1 month ago

Suggestion by the IDT team:

You could use a "streaming" approach where you use an iterator to read lines from that CSV gradually, and hand them out to pool.imap_unordered() (https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.imap_unordered) as they come. At no time do you require all this data in memory.

Another suggestion on not asking for multiple CPUs:

A download job is network I/O-bound. It is limited entirely by the Internet connection to the outside. It is so slow that even one CPU core is massively more than enough for your needs. You're asking for 64. That means 63.5+ of those cores are wasted. Furthermore, with a streaming I/O approach as I suggest, you should not need more than 10G of RAM, meaning that your ask for 300G RAM is also >95% waste. You simply do not need to load all these URLs in RAM, still less shove them into Pandas.

adityajain07 commented 1 month ago

Another comment:

CPUs != processes != parallel != faster You gotta know where your probable bottleneck is. You will not get more than about 100MB/s down, in almost all certainty, through our internet pipe. Our filesystems are much faster than that so they're not the bottleneck. One CPU core can handle a download program and the disk I/O that 100MB/s through-traffic generates. Downloading images one-by-one is not automatically bad. It might be bad if, between downloads, there is dead time (from writing out the file, or selecting the next URL, or any other reason). It might be worthwhile, therefore, to have a handful of downloads going in parallel to saturate the network connection. A handful, here, is determined by that dead time, which is a property of your download code's (in)efficiency. It is not determined principally by CPU core count, in fact it is almost completely independent of it, and it is definitely not 64. It is extremely likely that 4-8 parallel downloads on 1 CPU core will saturate the download bandwidth entirely. One CPU core can handle almost any number of processes so long as these processes are mostly sleeping waiting for I/O and using negligible CPU% - which is likely the case for you. If you've tied the number of downloads to the number of cores, that's a mistake. Remove that tie. It's really got nothing to do with CPU or GPU usage efficiency - a download job is principally about moving data and just about any single CPU core ought to be adequate for all but the highest-performance downloads on the highest-performance networks and networked filesystems.

adityajain07 commented 1 month ago

Re-emphasizing: Having more than 1 downloading processes might help you reach the max bandwidth of probably 100MB/s, but adding more processes than the minimum required will only slow down every other process's download. Furthermore, since most of the time these processes will be sleeping waiting for data to come in/out, they won't be using the CPU, which can then be time-sliced between all of them. That's why you only need one CPU core, might only need a small handful of download processes, and definitely not 64 cores.

RolnickLab / ami-ml

Reduce memory usage for loading DwC-A file #37