DARMA-tasking / LB-analysis-framework

Analysis framework for exploring, testing, and comparing load balancing strategies
Other
3 stars 1 forks source link

Number of input files detected incorrectly #426

Closed nlslatt closed 1 year ago

nlslatt commented 1 year ago

When I try to load balance a dataset with 40 ranks (with files named data.$i.json where i runs from 0 through 39 with no additional zero padding), it only outputs 20 json files. The logger output shows that only the first 20 input files were loaded.

When I try to load balance a dataset with 128 ranks, it only loads the first 64 files and then outputs 64 files.

With an older version of LBAF that accepts n_ranks in the yaml file, all files are correctly loaded and output, so the naming convention should not be an issue.

These datasets used 2 ranks per compute node. I do not know if LBAF could be reading the json header and using the number of compute nodes used instead of the number of files present.

I am not able to reproduce this incorrect behavior on the user-defined-memory-toy-problem.

tlamonthezie commented 1 year ago

When I try to load balance a dataset with 40 ranks (with files named data.$i.json where i runs from 0 through 39 with no additional zero padding), it only outputs 20 json files. The logger output shows that only the first 20 input files were loaded.

When I try to load balance a dataset with 128 ranks, it only loads the first 64 files and then outputs 64 files.

With an older version of LBAF that accepts n_ranks in the yaml file, all files are correctly loaded and output, so the naming convention should not be an issue.

These datasets used 2 ranks per compute node. I do not know if LBAF could be reading the json header and using the number of compute nodes used instead of the number of files present.

I am not able to reproduce this incorrect behavior on the user-defined-memory-toy-problem.

This might be related to n_ranks auto-detection as explained here: https://github.com/DARMA-tasking/LB-analysis-framework/issues/353#issuecomment-1570176987 Is there some json_data[metadata][shared_node][num_nodes] value in the data files? Is the information saying the number of ranks per compute node in another key in the data file? Should we need to multiply num_nodes by this other node value ? If it is the case could you please provide sole data file ?

tlamonthezie commented 1 year ago

We need to NOT read json_data[metadata][shared_node][num_nodes]. Just read the file names to get n_ranks.