Closed ppebay closed 1 year ago
@ppebay I havo no idea what the problem is.
I have put task_id
at the beginning of your print statement and I got:
5505027 0 0 23068672.0 68157440.0
5242883 0 0 23068672.0 139460608.0
4980739 0 0 23068672.0 206569472.0
4718595 0 0 23068672.0 276824064.0
4456451 0 0 23068672.0 276824064.0
1572867 0 0 23068672.0 276824064.0
1310723 0 0 23068672.0 276824064.0
1048579 0 0 23068672.0 276824064.0
786435 0 0 23068672.0 276824064.0
262147 0 0 23068672.0 276824064.0
524291 0 0 23068672.0 276824064.0
1835011 0 0 23068672.0 276824064.0
2097155 0 0 23068672.0 276824064.0
2359299 0 0 23068672.0 276824064.0
2621443 0 0 23068672.0 276824064.0
2883587 0 0 23068672.0 276824064.0
3145731 0 0 23068672.0 276824064.0
3407875 0 0 23068672.0 276824064.0
3670019 0 0 23068672.0 276824064.0
3932163 0 0 23068672.0 276824064.0
4194307 0 0 23068672.0 276824064.0
3670023 0 1 25165824.0 73400320.0
3407879 0 1 25165824.0 149946368.0
3145735 0 1 25165824.0 149946368.0
2883591 0 1 25165824.0 149946368.0
2621447 0 1 25165824.0 149946368.0
2359303 0 1 25165824.0 149946368.0
2097159 0 1 25165824.0 149946368.0
1835015 0 1 25165824.0 149946368.0
524295 0 1 25165824.0 226492416.0
262151 0 1 25165824.0 226492416.0
786439 0 1 25165824.0 303038464.0
1048583 0 1 25165824.0 303038464.0
1310727 0 1 25165824.0 303038464.0
1572871 0 1 25165824.0 303038464.0
5505035 0 2 27262976.0 76546048.0
5242891 0 2 27262976.0 76546048.0
4980747 0 2 27262976.0 152043520.0
4718603 0 2 27262976.0 152043520.0
4456459 0 2 27262976.0 152043520.0
1572875 0 2 27262976.0 222298112.0
1310731 0 2 27262976.0 222298112.0
1048587 0 2 27262976.0 222298112.0
786443 0 2 27262976.0 222298112.0
262155 0 2 27262976.0 297795584.0
524299 0 2 27262976.0 297795584.0
1835019 0 2 27262976.0 297795584.0
2097163 0 2 27262976.0 297795584.0
2359307 0 2 27262976.0 297795584.0
2621451 0 2 27262976.0 297795584.0
2883595 0 2 27262976.0 297795584.0
3145739 0 2 27262976.0 297795584.0
3407883 0 2 27262976.0 297795584.0
3670027 0 2 27262976.0 297795584.0
3932171 0 2 27262976.0 297795584.0
4194315 0 2 27262976.0 297795584.0
3932175 0 3 24117248.0 74448896.0
3670031 0 3 24117248.0 147849216.0
3407887 0 3 24117248.0 218103808.0
3145743 0 3 24117248.0 218103808.0
2883599 0 3 24117248.0 218103808.0
2621455 0 3 24117248.0 218103808.0
2359311 0 3 24117248.0 218103808.0
2097167 0 3 24117248.0 292552704.0
1835023 0 3 24117248.0 292552704.0
524303 0 3 24117248.0 292552704.0
262159 0 3 24117248.0 292552704.0
786447 0 3 24117248.0 292552704.0
1048591 0 3 24117248.0 292552704.0
1310735 0 3 24117248.0 292552704.0
1572879 0 3 24117248.0 292552704.0
As you can see above these are all different tasks read from the data.
When you take a look at fake_mem_usage.0.json
value 23068672.0
appears 21 times
When you take a look at fake_mem_usage.1.json
value 25165824.0
appears 14 times
When you take a look at fake_mem_usage.2.json
value 27262976.0
appears 21 times
When you take a look at fake_mem_usage.3.json
value 24117248.0
appears 15 times
For me it all prints what's in data. I can not see any problems regarding that....
Thanks so that means that the same object (task), causes the Rank()
constructor to be invoked multiple times for the same rank ID and phase, right? That should not happen: a given rank at a given phase ID only exists once in the distribution.
The constructor is called several times for the same rank id and phase id.
As a result, when some instance variables are read from file, that is fine -- but for others that are based on summation across objects (like the shared block memory that I added last week) it results in the summation counter being reset to 0.0 each time -- and as a result, the aggregated value being wrong.
The main logic should be the following: the Rank()
constructor for a given rank ID and a given phase ID, which uniquely identify ONE rank instance in the LBAF model, should be called only once.
If it is called more than once for the same rank ID and same phase ID then it is a bug.
I have checked so far and the flow is:
LoadReader
as instance variable called vt_files
LoadReader.vt_files
LoadReader.vt_files
LoadReader.vt_files
can have many phases
and we iterate over them (phases
) as wellphases
have tasks
and we iterate over these tasks
as welltasks
we:
Object
based on information from the taskRank
if it hasn't existed (if exists returns Rank
)Rank
(to migratable objects or sentinel objects)On line 231 Rank is instantiated, but only when there is no key phase_id
if the key phase_id
is present it returns the value from that key:
returned_dict.setdefault(phase_id, Rank(node_id, logger=self.__logger))
Please find description of setdefault
method here: setdefault
Further clarification:
returned_dict
is a dictionary which is passed to the json_reader()
method (it is not created inside this method)Rank
inside returned_dict
dictionary can be created/instantiated more than once, with phase_id
as a keySample simulation:
a = dict()
a.setdefault(0, 'initial value')
'initial value'
a.setdefault(0, 'changed value')
'initial value'
a
{0: 'initial value'}
initial value
will not be overwritten when using setdefault
method on a dictionary a
with existing key 0
. setdefault
method always returns the value(in this case initial value
) when the key (in this case 0
) exists.
@ppebay please let me know if that's clear enough and answers your problem.
@ppebay I prepared an PR #306 Please check if that fix the problem you have.
Thanks but it does not fix the problem.
The issue is more fondamental and has to do with the fact that the Rank should be fully populated/parameterized only after all its constituent objects (tasks) have been processed
@ppebay the fix I proposed in #306 was making sure that the Rank
constructor was called only once per Rank
.
In case where:
returned_dict.setdefault(phase_id, Rank(node_id, logger=self.__logger))
stays in the code. It instantiates Rank, each time this method is called, but it returns the Rank object which was instantiated when that was firstly called. Then when it will get out of scope and will have no reference, it will be collected by the garbage collector.
The issue is more fondamental and has to do with the fact that the Rank should be fully populated/parameterized only after all its constituent objects (tasks) have been processed
@ppebay I looked as well how Rank
is instantiated. There is a set created in the constructor for both migratable_objects
and sentinel_objects
and then Objects
that are passed are simply add with set method add()
.
Below current constructor:
def __init__(self, i: int, logger: Logger, mo: set = None, so: set = None):
# Assign logger to instance variable
self.__logger = logger
# Member variables passed by constructor
self.__index = i
self.__migratable_objects = set()
if mo is not None:
for o in mo:
self.__migratable_objects.add(o)
self.__sentinel_objects = set()
if so is not None:
for o in so:
self.__sentinel_objects.add(o)
# No information about peers is known initially
self.__known_loads = {}
# No viewers exist initially
self.__viewers = set()
# No message was received initially
self.round_last_received = 0
The same story is when Rank
is instantiated empty and then Objects
are added to it.
Methods below:
def add_migratable_object(self, o: Object) -> None:
""" Add object to migratable objects."""
return self.__migratable_objects.add(o)
def add_sentinel_object(self, o: Object) -> None:
""" Add object to sentinel objects."""
return self.__sentinel_objects.add(o)
The potential problem with instance variables could be when one tries to do some operation on the Rank
if not all Objects
are added to it.
Fixed by PR #307
Example, with
memory.yaml
configuration, when putting aprint
statement in the rank constructor:The same rank is created as many times as objects belonging to it are encountered by the VT reader. Is this normal?