hca upload files using a lot of cpu/memory

HumanCellAtlas / dcp-cli

DEPRECATED - HCA Data Coordination Platform Command Line Interface

https://hca.readthedocs.io/

MIT License

6 stars 8 forks source link

hca upload files using a lot of cpu/memory #358

Open malloryfreeberg opened 5 years ago

malloryfreeberg commented 5 years ago

I was using hca upload files * to upload about 80GB of fastq files (16 files) from my local machine to an upload area. During the transfer, I experienced significant slowdown of everything else running on my machine. I don't remember experiencing this slowdown before, although I haven't had to transfer files from a local source in a while. It looks like my machine was maxed out on CPU usage (screenshots below). Wondering if this is normal or expected behavior? It doesn't seem ideal...

During transfer: Screen Shot 2019-06-10 at 09 54 25

After transfer: Screen Shot 2019-06-10 at 13 45 17

sampierson commented 5 years ago

I'm guessing this is because the CLI is now doing client-side checksumming? @maniarathi do you stream the file while checksumming or attempt to read it into memory? Hmm looks like it is streamed in chunks get_s3_multipart_chunk_size big. I wonder how much the memory balloons. Someone should attempt to reproduce this. Unfortunately it can't be me as I am on a low-bandwidth link.

maniarathi commented 5 years ago

So I did actually test the memory footprint of this a while back and the memory was 64MB which is what is expected given that it streams it in that sized chunks.

sampierson commented 5 years ago

@malloryfreeberg how much memory was consumed? Alas your Activity monitor screenshots don't show that.

sampierson commented 5 years ago

As for CPU, I expect that simultaneous checksumming of several files will be quite CPU intensive. Does it limit parallelization? It looks like it does, based on the number of cores you have DEFAULT_THREAD_COUNT = multiprocessing.cpu_count() * 2. On my machine cpu_count() returns 8, so that means it is trying to checksum all 16 files simultaneously. That's a bad thing.

There are several ways to fix this: 1) parallelize less aggressively - reduce thread count 2) provide a command line option to limit parallelism further 3) calculate checksums in-line while uploading, which will limit the parallelism based on your available bandwidth.

I realize # 3 doesn't work well with the current architecture, as client-side and server-side checksums are compared before upload starts. I wish there was a more efficient way to decide whether to upload or not. We should probably do # 1 and # 2.

malloryfreeberg commented 5 years ago

@sampierson @maniarathi I unfortunately did not grab memory usage during this time. I can reproduce, but I'll have to download the files to my local machine again :( Stay tuned!

sampierson commented 5 years ago

@malloryfreeberg Don't bother. I think we know what the culprit is. I think the problem is CPU not memory.