This branch adds a file_based routine to help user find the most I/O intensive files. The file is file_stats.py in PyDarshan CLI tools.
This branch also adds a test_file_stats.py to PyDarshan tests to test file_stats.py.
It combines the data from multiple log files to a DataFrame, groups the data by “id”, sorts data by the column name the user inputs in a descending order, and then filters the data with the first n (number_of_rows from user input) records. It returns a DataFrame with n most I/O intensive files.
User input includes log_path, module, order_by_colname, number_of_rows. The command line arguments are name arguments.
log_path should be a list of files or a shell glob.
The default values for module, order_by_colname, number_of_rows are “POSIX”, “POSIX_BYTES_READ”, 10, respectively. If users don’t input these values, the tool will use default values.
The tool checks if the module is in the list of modules. If not, it prints an error out and exits immediately.
order_by_colname should be “{mod}_BYTES_READ” or “{mod}_BYTES_WRITTEN”.
The tool also checks if the order_by_colname the user inputs is consistent with the module. For example, if the module and order_bycolname are POSIX and STDIO BYTES_READ, there will be an error “Column name should be ‘{mod}_BYTES_READ’ or ‘{mod}_BYTES_WRITTEN’“.
Description:
$ python -m darshan file_stats darshan_logs/nonmpi_workflow/worker.darshan -m STDIO -o STDIO_BYTES_READ -n 5 $ python -m darshan file_stats darshan_logs/nonmpi_workflow/worker.darshan $ python -m darshan file_stats darshan_logs/nonmpi_workflow/worker_1.darshan darshan_logs/nonmpi_workflow/worker_3.darshan -m STDIO -o STDIO_BYTES_READ -n 5