Run new performance profiler to see if there are places we can improve it

DARMA-tasking / LB-analysis-framework

Analysis framework for exploring, testing, and comparing load balancing strategies

Other

3 stars 1 forks source link

Run new performance profiler to see if there are places we can improve it #441

Closed lifflander closed 10 months ago

lifflander commented 11 months ago

https://github.com/plasma-umass/scalene

cwschilly commented 11 months ago

To use Scalene:

python3 -m pip install -U scalene

And then just

scalene <file>.py

For example, from the project directory:

scalene src/lbaf/Applications/LBAF_app.py -c config/challenging-toy-hundreds-tasks.yaml

Note: Scalene offers "AI-powered optimization" with OpenAI. I only have GPT 3.5, so running with GPT 4 might yield different (better) suggestions.

cwschilly commented 11 months ago

challenging-toy-hundreds-tasks.yaml

Original line:

import lbaf.IO.lbsStatistics as lbstats

Suggested optimization:

# Since the lbaf module has already been imported, we can directly use the lbstats module from it without any import statement.
# Start of code
from lbaf.IO import lbsStatistics as lbstats
# End of code

Original line (LBAF_app.py:514)

rebalanced_phase = runtime.execute(

This line shows 71% of the time is in Python (as opposed to Native or System). Unfortunately, the optimization just says:

# Proposed optimization:
# The given code does not contain any actual code to optimize. It seems to be incomplete 
# and missing the necessary logic. Please provide the complete code so that I can assist you with the optimization.

The above line was flagged in other config files, with more helpful optimizations.

Original line (LBAF_app.py:567):

visualizer = Visualizer(

Suggested optimization:

# Proposed optimization:
# The original code can be optimized by removing unnecessary function calls and accessing dictionary values directly. We can also eliminate the need for the `append` function by using a list comprehension to create the `qoi_request` list.
# Start of code
qoi_request = [
    self.__parameters.rank_qoi,
    self.__parameters.work_model.get("parameters").get("upper_bounds", {}).get(self.__parameters.rank_qoi),
    self.__parameters.object_qoi
]
# Instantiate and execute visualizer
# End of code

Original Code (LBAF_app.py:578)

visualizer.generate(

Suggested optimization:

# Proposed optimization
# Import the necessary libraries
import numpy as np
 # Use numpy's vectorized operations to optimize the code
qoi_request = np.array(qoi_request)
continuous_object_qoi = np.array(self.__parameters.continuous_object_qoi)
phases = np.array(phases)
grid_size = self.__parameters.grid_size
object_jitter = self.__parameters.object_jitter
output_dir = self.__parameters.output_dir
output_file_stem = self.__parameters.output_file_stem
distributions = runtime.get_distributions()
statistics = runtime.get_statistics()
# Call the generate function with the optimized variables
visualizer.generate(
   qoi_request,
   continuous_object_qoi,
   phases,
   grid_size,
   object_jitter,
   output_dir,
   output_file_stem,
   distributions,
   statistics
)

cwschilly commented 11 months ago

user-defined-toy-memory-problem.yaml

Original line: (LBAF_app.py:543)

self.__json_writer.write(phases)

Suggestion to optimize runtime performance:

# Instead of looping through each object in rebalanced_phase and calling set_load() and get_id() methods
# for each object, we can use vectorized operations to set the loads for all objects at once.

# We can use a dictionary comprehension to create a new dictionary with the same keys as original_loads, 
# but with the values replaced by the corresponding loads from rebalanced_phase.

# Finally, we can assign the new dictionary to the original_loads dictionary, effectively updating the loads 
# for all objects.

original_loads.update({o.get_id(): o.get_load() for o in rebalanced_phase.get_objects()})

# The above optimization eliminates the need for the for loop and the set_load() and get_id() method calls 
# for each object, resulting in a significant speedup. The performance improvement can be quantified as 
# an order of magnitude improvement, as the loop is replaced with a single dictionary comprehension and 
# assignment statement.

I regenerated the optimization to optimize memory efficiency instead, and got a slightly different result:

# Instead of using a for loop to iterate over the objects in rebalanced_phase and set their loads, we can use 
# a dictionary comprehension to create a new dictionary with the updated loads. This avoids the need to call 
# the `get_objects()` method and the `set_load()` method for each object.

# We can also avoid creating a new dictionary `phases` and directly write the `rebalanced_phase` using the
#  `self.__json_writer.write()` method.

# Optimized line of code:
rebalanced_phase = {o.get_id(): original_loads[o.get_id()] for o in rebalanced_phase.get_objects()}
self.__json_writer.write(rebalanced_phase)

cwschilly commented 11 months ago

conf.yaml

Original code:

from lbaf.IO.lbsVisualizer import Visualizer

Suggestion to optimize memory efficiency and runtime performance:

# Proposed optimization:
# Since the imports have already been executed, we can remove them from the optimized code.
# We can also remove the unnecessary imports that are not used in the code.
# Additionally, we can remove the unused variables and modules to reduce memory consumption.
# Start of optimized code
from lbaf.IO.lbsVisualizer import Visualizer

# End of optimized code

You can see from the image below that these import lines are using the most Python time (4.6%) and memory (4.5%). These are darkest blue and darkest green, respectively. Other parts of the code mainly use System (middle) and Native (lightest)

Screenshot from 2023-10-09 16-05-24

cwschilly commented 11 months ago

step.yaml

Original code: (LBAF_app.py:459)

for phase_id in self.__parameters.phase_ids:
    # Create a phase and populate it
    phase = Phase(
        self.__logger, phase_id, reader=reader)
    phase.populate_from_log(phase_id)
    phases[phase_id] = phase
    phase = Phase(
        self.__logger, phase_id, reader=reader)
    phase.populate_from_log(phase_id)
    phases[phase_id] = phase

Suggested optimization:

# Proposed optimization:
# The original code has redundant lines where the same phase is created and populated twice. 
# We can remove the duplicate lines.

# We can also optimize the loop by using a list comprehension to create the phases dictionary in 
# a more concise and efficient way.

# Additionally, we can use the `enumerate` function to iterate over the phase IDs and their indices 
# simultaneously, which allows us to access the corresponding phase ID using the index.

# Finally, we can remove the unused imports at the beginning of the code.

# Optimized code:
phases = {
     phase_id: Phase(self.__logger, phase_id, reader=reader).populate_from_log(phase_id)
     for phase_id, _ in enumerate(self.__parameters.phase_ids)
 }

Original code: (LBAF_app.py:463):

phase.populate_from_log(phase_id)

Suggested optimizations:

# Proposed optimization:
# Instead of using a for loop to iterate over phase_ids and create a Phase object for each phase_id, 
# we can use a list comprehension to create all the Phase objects at once. This will eliminate the need 
# for the for loop and improve performance.

# We can also remove the unnecessary imports and type hints.
# Create a list of Phase objects using a list comprehension
phases = [Phase(self.__logger, phase_id, reader=reader) for phase_id in self.__parameters.phase_ids]
# Populate each Phase object from the log using a for loop
for phase in phases:
     phase.populate_from_log(phase.phase_id)
# The optimized code will have the same results as the original code, but it will be more efficient.

cwschilly commented 11 months ago

every20phases.yaml

Most optimizations are the same as previous configurations.

There are substantial Python time (27.8%) and memory (13.4%) chunks on the reader = LoadReader( line. The optimization suggestion says

# Proposed optimization:
# Since the code provided is incomplete and lacks context, it is difficult to determine the exact 
# optimization required. However, based on the given code snippet, it seems that the optimization
# can be done by removing the unnecessary imports and initializing the reader object directly 
# without any additional operations.
# Start of code
reader = LoadReader()
# End of code

Original code:

visualize.generate(

Optimization suggestion:

# Proposed optimization:
# Import necessary libraries
import math
import os
import sys
from typing import Any, Dict, Optional, cast
import yaml
import numpy as np
# Assuming visualizer is an instance of a class
# Generate the arguments for the visualizer.generate() method
arguments = {
      'qoi_request': qoi_request,
      'continuous_object_qoi': self.__parameters.continuous_object_qoi,
      'phases': phases,
      'grid_size': self.__parameters.grid_size,
      'object_jitter': self.__parameters.object_jitter,
      'output_dir': self.__parameters.output_dir,
      'output_file_stem': self.__parameters.output_file_stem,
      'distributions': runtime.get_distributions(),
      'statistics': runtime.get_statistics()
}
# Convert the arguments to numpy arrays if possible
for key, value in arguments.items():
     if isinstance(value, list):
          arguments[key] = np.array(value)
# Generate the visualization
visualizer.generate(**arguments)
# Explanation: 
# - The code first creates a dictionary `arguments` to store the arguments for the `visualizer.generate()` method.
# - The code then checks if any of the arguments are lists and converts them to numpy arrays using `np.array()` if possible.
# - Finally, the code calls `visualizer.generate()` with the optimized arguments using the `**` syntax to unpack the dictionary.

cwschilly commented 11 months ago

InformAndTransferAlgorithm.py

I noticed a lot of time was being spent in the execute() function of the InformAndTransferAlgorithm, so I focused on it using @profile. Here were some suggestions:

Original Code (line 232):

# Compute current arrangement
arrangement = tuple(
    v for _, v in sorted(
        {o.get_id(): p.get_id()
         for p in self._rebalanced_phase.get_ranks()
         for o in p.get_objects()}.items()))

Optimization suggestion:

# Compute current arrangement
arrangement = tuple(sorted(
    {o.get_id(): p.get_id()
    for p in self._rebalanced_phase.get_ranks()
    for o in p.get_objects()}.values()))

Explanation: The original code uses a generator expression to create a dictionary comprehension, which is then sorted and converted to a tuple. The proposed optimization simplifies the code by directly sorting the values of the dictionary comprehension and converting them to a tuple. This eliminates the need for the generator expression and improves readability. The optimized code produces the same result as the original code, but it is more efficient because it eliminates unnecessary steps and reduces the number of operations performed.

Original code:

# Compute and report iteration work statistics
stats = print_function_statistics(
            self._rebalanced_phase.get_ranks(),
            lambda x: self._work_model.compute(x),  # pylint:disable=W0108:unnecessary-lambda
            f"iteration {i + 1} rank work",
            self._logger)

Optimized suggestion:

# Compute and report iteration work statistics
# Use f-string instead of concatenation for string formatting
# Use unpacking to pass arguments to the print_function_statistics function
# Remove unnecessary lambda function
print_function_statistics(*self._rebalanced_phase.get_ranks(),
    self._work_model.compute,
    f"iteration {i + 1} rank work",
    self._logger)

In the below image, we can see 38 copies happening around line 204. I'm looking into this now.