DARMA-tasking / LB-analysis-framework

Analysis framework for exploring, testing, and comparing load balancing strategies
Other
3 stars 1 forks source link

Same rank created multiple times by VT data reader #305

Closed ppebay closed 1 year ago

ppebay commented 1 year ago

Example, with memory.yaml configuration, when putting a print statement in the rank constructor:

[lbsVTDataReader] Reading /Users/pppebay/Documents/Git/LB-analysis-framework/data/user-defined-memory-exemplar/fake_mem_usage.2.json VT object map
0 23068672.0 68157440.0
0 23068672.0 139460608.0
0 23068672.0 206569472.0
0 23068672.0 276824064.0
0 23068672.0 276824064.0
0 23068672.0 276824064.0
0 23068672.0 276824064.0
0 23068672.0 276824064.0
0 23068672.0 276824064.0
0 23068672.0 276824064.0
0 23068672.0 276824064.0
0 23068672.0 276824064.0
0 23068672.0 276824064.0
0 23068672.0 276824064.0
0 23068672.0 276824064.0
0 23068672.0 276824064.0
0 23068672.0 276824064.0
0 23068672.0 276824064.0
0 23068672.0 276824064.0
0 23068672.0 276824064.0
0 23068672.0 276824064.0

The same rank is created as many times as objects belonging to it are encountered by the VT reader. Is this normal?

marcinwrobel1986 commented 1 year ago

@ppebay I havo no idea what the problem is. I have put task_id at the beginning of your print statement and I got:

5505027 0 0 23068672.0 68157440.0
5242883 0 0 23068672.0 139460608.0
4980739 0 0 23068672.0 206569472.0
4718595 0 0 23068672.0 276824064.0
4456451 0 0 23068672.0 276824064.0
1572867 0 0 23068672.0 276824064.0
1310723 0 0 23068672.0 276824064.0
1048579 0 0 23068672.0 276824064.0
786435 0 0 23068672.0 276824064.0
262147 0 0 23068672.0 276824064.0
524291 0 0 23068672.0 276824064.0
1835011 0 0 23068672.0 276824064.0
2097155 0 0 23068672.0 276824064.0
2359299 0 0 23068672.0 276824064.0
2621443 0 0 23068672.0 276824064.0
2883587 0 0 23068672.0 276824064.0
3145731 0 0 23068672.0 276824064.0
3407875 0 0 23068672.0 276824064.0
3670019 0 0 23068672.0 276824064.0
3932163 0 0 23068672.0 276824064.0
4194307 0 0 23068672.0 276824064.0
3670023 0 1 25165824.0 73400320.0
3407879 0 1 25165824.0 149946368.0
3145735 0 1 25165824.0 149946368.0
2883591 0 1 25165824.0 149946368.0
2621447 0 1 25165824.0 149946368.0
2359303 0 1 25165824.0 149946368.0
2097159 0 1 25165824.0 149946368.0
1835015 0 1 25165824.0 149946368.0
524295 0 1 25165824.0 226492416.0
262151 0 1 25165824.0 226492416.0
786439 0 1 25165824.0 303038464.0
1048583 0 1 25165824.0 303038464.0
1310727 0 1 25165824.0 303038464.0
1572871 0 1 25165824.0 303038464.0
5505035 0 2 27262976.0 76546048.0
5242891 0 2 27262976.0 76546048.0
4980747 0 2 27262976.0 152043520.0
4718603 0 2 27262976.0 152043520.0
4456459 0 2 27262976.0 152043520.0
1572875 0 2 27262976.0 222298112.0
1310731 0 2 27262976.0 222298112.0
1048587 0 2 27262976.0 222298112.0
786443 0 2 27262976.0 222298112.0
262155 0 2 27262976.0 297795584.0
524299 0 2 27262976.0 297795584.0
1835019 0 2 27262976.0 297795584.0
2097163 0 2 27262976.0 297795584.0
2359307 0 2 27262976.0 297795584.0
2621451 0 2 27262976.0 297795584.0
2883595 0 2 27262976.0 297795584.0
3145739 0 2 27262976.0 297795584.0
3407883 0 2 27262976.0 297795584.0
3670027 0 2 27262976.0 297795584.0
3932171 0 2 27262976.0 297795584.0
4194315 0 2 27262976.0 297795584.0
3932175 0 3 24117248.0 74448896.0
3670031 0 3 24117248.0 147849216.0
3407887 0 3 24117248.0 218103808.0
3145743 0 3 24117248.0 218103808.0
2883599 0 3 24117248.0 218103808.0
2621455 0 3 24117248.0 218103808.0
2359311 0 3 24117248.0 218103808.0
2097167 0 3 24117248.0 292552704.0
1835023 0 3 24117248.0 292552704.0
524303 0 3 24117248.0 292552704.0
262159 0 3 24117248.0 292552704.0
786447 0 3 24117248.0 292552704.0
1048591 0 3 24117248.0 292552704.0
1310735 0 3 24117248.0 292552704.0
1572879 0 3 24117248.0 292552704.0

As you can see above these are all different tasks read from the data. When you take a look at fake_mem_usage.0.json value 23068672.0 appears 21 times When you take a look at fake_mem_usage.1.json value 25165824.0 appears 14 times When you take a look at fake_mem_usage.2.json value 27262976.0 appears 21 times When you take a look at fake_mem_usage.3.json value 24117248.0 appears 15 times

For me it all prints what's in data. I can not see any problems regarding that....

ppebay commented 1 year ago

Thanks so that means that the same object (task), causes the Rank() constructor to be invoked multiple times for the same rank ID and phase, right? That should not happen: a given rank at a given phase ID only exists once in the distribution.

ppebay commented 1 year ago

The constructor is called several times for the same rank id and phase id.

As a result, when some instance variables are read from file, that is fine -- but for others that are based on summation across objects (like the shared block memory that I added last week) it results in the summation counter being reset to 0.0 each time -- and as a result, the aggregated value being wrong.

The main logic should be the following: the Rank() constructor for a given rank ID and a given phase ID, which uniquely identify ONE rank instance in the LBAF model, should be called only once.

If it is called more than once for the same rank ID and same phase ID then it is a bug.

marcinwrobel1986 commented 1 year ago

I have checked so far and the flow is:

On line 231 Rank is instantiated, but only when there is no key phase_id if the key phase_id is present it returns the value from that key:

                    returned_dict.setdefault(phase_id, Rank(node_id, logger=self.__logger))

Please find description of setdefault method here: setdefault

Further clarification:

Sample simulation:

a = dict()
a.setdefault(0, 'initial value')
'initial value'
a.setdefault(0, 'changed value')
'initial value'
a
{0: 'initial value'}

initial value will not be overwritten when using setdefault method on a dictionary a with existing key 0. setdefault method always returns the value(in this case initial value) when the key (in this case 0) exists.

@ppebay please let me know if that's clear enough and answers your problem.

marcinwrobel1986 commented 1 year ago

@ppebay I prepared an PR #306 Please check if that fix the problem you have.

ppebay commented 1 year ago

Thanks but it does not fix the problem.

The issue is more fondamental and has to do with the fact that the Rank should be fully populated/parameterized only after all its constituent objects (tasks) have been processed

marcinwrobel1986 commented 1 year ago

@ppebay the fix I proposed in #306 was making sure that the Rank constructor was called only once per Rank. In case where:

returned_dict.setdefault(phase_id, Rank(node_id, logger=self.__logger))

stays in the code. It instantiates Rank, each time this method is called, but it returns the Rank object which was instantiated when that was firstly called. Then when it will get out of scope and will have no reference, it will be collected by the garbage collector.

My fix in #306 was preventing unnecessary creation of Rank objects.

marcinwrobel1986 commented 1 year ago

The issue is more fondamental and has to do with the fact that the Rank should be fully populated/parameterized only after all its constituent objects (tasks) have been processed

@ppebay I looked as well how Rank is instantiated. There is a set created in the constructor for both migratable_objects and sentinel_objects and then Objects that are passed are simply add with set method add(). Below current constructor:

    def __init__(self, i: int, logger: Logger, mo: set = None, so: set = None):
        # Assign logger to instance variable
        self.__logger = logger

        # Member variables passed by constructor
        self.__index = i
        self.__migratable_objects = set()
        if mo is not None:
            for o in mo:
                self.__migratable_objects.add(o)
        self.__sentinel_objects = set()
        if so is not None:
            for o in so:
                self.__sentinel_objects.add(o)

        # No information about peers is known initially
        self.__known_loads = {}

        # No viewers exist initially
        self.__viewers = set()

        # No message was received initially
        self.round_last_received = 0

The same story is when Rank is instantiated empty and then Objects are added to it. Methods below:

    def add_migratable_object(self, o: Object) -> None:
        """ Add object to migratable objects."""
        return self.__migratable_objects.add(o)

    def add_sentinel_object(self, o: Object) -> None:
        """ Add object to sentinel objects."""
        return self.__sentinel_objects.add(o)

The potential problem with instance variables could be when one tries to do some operation on the Rank if not all Objects are added to it.

ppebay commented 1 year ago

Fixed by PR #307