Hi @orbeckst,

I've created separate git repositories for each cluster since things were getting a little too messy pushing and pulling from 4 different sources. I've linked the code I used to benchmark on Comet and an example notebook of how I took the averages.

If you get a chance can you please scan over it? I've been over it many times and don't think there are any errors, but a second set of eyes would give me peace of mind. We can also go over it on Wednesday so no worries either way

Here's the actual benchmark function: https://github.com/edisj/Comet/blob/6662ba7a50d01018d58b3b979f3f45e48829e545/benchmarks/1-full_IO/scripts/full_IO_bench.py#L23-L122

I can't figure out how to link specific lines in the jupyter notebook, so here's a link to it: https://github.com/edisj/Comet/blob/main/benchmarks/example_analysis.ipynb

Can you please look at the functions reduce_to_means() and all_process_dataframe().

reduce_to_means() loads the raw data arrays into the _dict dictionary, and goes through each repeat and takes the average across all ranks. Then it takes the average and std dev. across the repeats.

all_process_dataframe() initializes an (N_process x timings) matrix, and fills in each row by using the reduce_to_means() function to get the averaged times for each N process run.

Once I have the data averaged in a nice table, I plot the timings by extracting the columns I'm interested in.

Thank you! Edis

README

add a README saying how to install (in particular, which MDA repo and branch to pull, notes on how you built your environment)
add a note as to which version of MDAnalysis to use, in particular which branch from your own repo (because you use code with timing information inside the reader)

full_IO_bench.py

add a note as to which version of MDAnalysis to use, in particular which branch from your own repo (because you use code with timing information inside the reader) – comment in the README and next to https://github.com/edisj/Comet/blob/6662ba7a50d01018d58b3b979f3f45e48829e545/benchmarks/1-full_IO/scripts/full_IO_bench.py#L1
time https://github.com/edisj/Comet/blob/6662ba7a50d01018d58b3b979f3f45e48829e545/benchmarks/1-full_IO/scripts/full_IO_bench.py#L76 (closing trajectory)

notebook

add some sections headings and minimal text
I have no good sense what your data structure look like. My general impression is that organizing everything as pd.DataFrame (or xarray) as early as possible and then operating on df (using group_by etc) will make it more readable and extensible.
As a general rule: once you have data in tidy data format (see the Tidy Data paper, all these operations are much easier. Basically, the rules are
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table. There's more in the paper and you can also read the wikipedia: Tidy Data entry. The time you spend thinking about your data organization and production in the beginning is easily recouped later when you don't have to reformat your data. Think about data structures first!
There's no guarantee that the order of values in for i, array in enumerate(_dict.values()): is always the same. Ordering is not guaranteed in a dict (unless it's an OrderedDict). As far as I can tell, you might be assigning repeats in random order. This raises the question, why are you building a dict in the first place? Why not just a list where the index is already sufficient to identify the repeat? (If you don't want to work with DataFrames at this stage.)
all_process_dataframe has a lot of code duplication for the different which_hpc cases. I'd try to only have one code path and parametrize it. You can then have a dispatch table dict {'Redo1': {'cores': [1], ....}, ..., 'Comet': {'cores': [1,2,4,6,8,12,16,20,24,28,32]}} that defines how you will perform each analysis. Using dispatch tables instead of multiple if/elif tends to lead to more maintainable and structured code.
The df from all_process_dataframe looks a bit odd: Why is N_processes not a normal column label – did you introduce multi-level indices by accident or purpose? And why is the column not integer? Why do you need to take [0] of the df?

edisj / Comet

Code proofread #1

README

full_IO_bench.py

notebook