JudeWells / chainsaw

MIT License
27 stars 2 forks source link

Zip extraction performance #15

Closed alex-hh closed 1 year ago

alex-hh commented 1 year ago

Would simply operating on proteome files be faster, since we could in that case simply use unzip? If so, by how much?

The partitioned zip index files have 10000 af structures each. (There are 18897 partitions). If we ran one prediction per second this would be 166 minutes per partition. Which would mean that unzipping for 30 minutes is a substantial fraction of the overall cost.

Are these partitioned index files processed in important ways? (eg guess they might have unique chains only?)

alex-hh commented 1 year ago

/usr/bin/time output for zip_extract on first partition:

1195.45user 10.63system 31:52.98elapsed 63%CPU (0avgtext+0avgdata 26264maxresident)k 4226304inputs+242960outputs (1major+1424318minor)pagefaults 0swaps

ChatGPT interpretation:

Sure, I'd be happy to explain. When you call /usr/bin/time, it provides you with some statistics about the program you just ran. Here's what each piece of your output means:

1195.45user: This is the amount of CPU time spent in user-mode within the application. This time includes executing the code of the program itself. This is reported in seconds.

10.63system: This is the amount of CPU time that the system spent on behalf of the application (such as executing system calls). This is also reported in seconds.

31:52.98elapsed: This is the real time or 'wall clock time' that the program took to run, from start to finish. This includes both user time, system time and time spent waiting (like waiting for disk I/O or network responses). The format is minutes:seconds.hundredths-of-seconds.

63%CPU: This is the percentage of the CPU's time that the process used, which is basically (total time)/(real time). This could be over 100% for multithreaded programs running on multiprocessor systems.

(0avgtext+0avgdata 26264maxresident)k: This represents the memory usage. In this case, 0avgtext and 0avgdata represent the average size of the text and data segments in Kilobytes. 26264maxresidentk indicates the maximum resident set size of the process during its lifetime, measured in Kilobytes.

4226304inputs+242960outputs: This shows the number of disk operations. Inputs are block input operations (like reading from a disk) and outputs are block output operations (like writing to a disk). The numbers are in units of disk blocks.

(1major+1424318minor)pagefaults: Page faults occur when a process tries to access a part of its virtual memory space that is currently not available. Major page faults require accessing the disk (which is slow), and minor page faults could be resolved without disk access (for example, it may just require moving pages back from the swap file into RAM). In this case, it shows 1major and 1424318minor page faults.

0swaps: The number of times the process was swapped out of main memory. If this number is high, it might indicate that your system is low on memory and had to move this process to the swap file on disk, which can significantly slow down your process. In this case, the process wasn't swapped out at all.

alex-hh commented 1 year ago

@JudeWells any thoughts on this?

alex-hh commented 1 year ago

There are far more proteome files than partition files (1,015,797). However, the overhead to starting a script is smaller than the overhead to extracting named files from the zip files.

alex-hh commented 1 year ago

If we have 1000s of CPUs, this becomes pretty insignificant:

~10000 hours total (20000x0.5) = 10 hours on each of 1000 CPUs