fabric-testbed / InformationModel

FABRIC Information Model library
MIT License
7 stars 1 forks source link

Topology validate fails with error: dictionary changed size during iteration #131

Closed kthare10 closed 1 year ago

kthare10 commented 2 years ago

Topology validate fails with error: dictionary changed size during iteration

Snapshot of the stack trace:

2022-08-18 13:19:29,962 - orchestrator - {slices_controller.py:66} - [waitress-0] - ERROR - dictionary changed size during iteration
Traceback (most recent call last):
  File "/usr/src/app/fabric_cf/orchestrator/core/orchestrator_handler.py", line 248, in create_slice
    topology.validate()
  File "/usr/local/lib/python3.9/site-packages/fim/user/topology.py", line 661, in validate
    s.validate_constraints(node_interfaces)
  File "/usr/local/lib/python3.9/site-packages/fim/user/network_service.py", line 231, in validate_constraints
    self.__validate_nstype_constraints(nstype, interfaces)
  File "/usr/local/lib/python3.9/site-packages/fim/user/network_service.py", line 218, in __validate_nstype_constraints
    sites.add(owner.site)
  File "/usr/local/lib/python3.9/site-packages/fim/user/node.py", line 111, in site
    return self.get_property('site') if self.__dict__.get('topo', None) is not None else None
  File "/usr/local/lib/python3.9/site-packages/fim/user/node.py", line 227, in get_property
    _, node_properties = self.topo.graph_model.get_node_properties(node_id=self.node_id)
  File "/usr/local/lib/python3.9/site-packages/fim/graph/networkx_property_graph.py", line 92, in get_node_properties
    node_props = self.storage.get_graph(self.graph_id).nodes[self._find_node(node_id=node_id)].copy()
  File "/usr/local/lib/python3.9/site-packages/fim/graph/networkx_mixin.py", line 58, in _find_node
    query_match = list(nxq.search_nodes(self.storage.get_graph(use_graph_id),
RuntimeError: dictionary changed size during iteration
kthare10 commented 2 years ago

ospf_slice.graphml.txt

Attaching the graphml of the topology which caused the error.

ibaldin commented 2 years ago

Need to see if this is still a problem.

ibaldin commented 1 year ago

@kthare10 is this error popping up in Orchestrator at graph ingestion? Is it possible somehow the model is being modified by one thread while being validated by another? This code is not thread-safe and at least at first glance this is looking to me like a race condition. Simply validating this graph after loading it doesn't create any problems.

kthare10 commented 1 year ago

Yes, the error pops up as soon as Orchestrator is validating the incoming create slice request. No other thread is processing the slice at that point. This function is called directly from the Flask end point handler. I'll try to reproduce it and keep you posted.

    def create_slice(self, *, token: str, slice_name: str, slice_graph: str, ssh_key: str,
                     lease_end_time: str) -> List[dict]:
        if self.globals.is_maintenance_mode_on():
            raise OrchestratorException(Constants.MAINTENANCE_MODE_ERROR)

        slice_id = None
        controller = None
        new_slice_object = None
        asm_graph = None
        try:
            end_time = self.__validate_lease_end_time(lease_end_time=lease_end_time)

            controller = self.controller_state.get_management_actor()
            self.logger.debug(f"create_slice invoked for Controller: {controller}")

            # Validate the slice graph
            topology = ExperimentTopology(graph_string=slice_graph)
            topology.validate()
....
kthare10 commented 1 year ago

It looks the issue is because NetworkXGraphImporter class has an inbuilt NetworkXGraphStorage class which uses a singleton object __NetworkXGraphStorage which is not thread safe.

Also, in the current CF implementation, when the topology object goes out of scope it is not explicitly deleting the graph from the NetworkXGraphStorage. So this Singleton instance on production must be growing pretty big.

It has resulted in some of the operations taking as long as 250 seconds.

class NetworkXGraphImporter(ABCGraphImporter):
    """
    Importer for NetworkX graphs. Stores graphs in a single NetworkX
    object. Care is taken to disambiguate nodes when loading graphs.
    """
    READ_FORMATS = ["json_nodelink", "graphml"]

    def __init__(self, *, logger=None):
        """
        Initialize the importer setting up storage and logger
        :param logger:
        """
        self.storage = NetworkXGraphStorage(logger=logger)

class NetworkXGraphStorage:
    def __init__(self, logger=None):
        if not NetworkXGraphStorage.storage_instance:
            NetworkXGraphStorage.storage_instance = NetworkXGraphStorage.__NetworkXGraphStorage(logger=logger)

    class __NetworkXGraphStorage:
        """
        Singleton in-memory storage for graphs
        """

        def __init__(self, logger=None):
            self.graphs = nx.Graph()
            self.start_id = 1
            self.log = logger