CrayLabs / SmartSim

SmartSim Infrastructure Library.
BSD 2-Clause "Simplified" License
235 stars 37 forks source link

Ambiguous API for DB Object Placement Swallows `first_device` Parameter #534

Open MattToast opened 8 months ago

MattToast commented 8 months ago

Description

When a user specifies a first_device parameter, we do not use it to format the enumerated devices IFF the num_devices_per_node != 1. This can lead to some unexpected formatting of enumerated devices.

How to reproduce

Consider the Following Script

from smartsim.entity.dbobject import DBModel

def pprint_db_obj(obj):
    print(f"{obj.name}: {obj.devices}")

def main():
    o1 = DBModel(name="Normal Object", model=b"some-model", backend="TORCH",
                 device="GPU")
    o2 = DBModel(name="Object w/ Device as str", model=b"some-model", backend="TORCH",
                 device="GPU:1")
    o3 = DBModel(name="Object w/ First Device", model=b"some-model", backend="TORCH",
                 device="GPU", first_device=1)
    o4 = DBModel(name="Object w/ Single Enumerated First Device", model=b"some-model", backend="TORCH",
                 device="GPU", first_device=1, devices_per_node=1)
    o5 = DBModel(name="Object w/ Multi Enumerated First Device", model=b"some-model", backend="TORCH",
                 device="GPU", first_device=1, devices_per_node=3)
    o6 = DBModel(name="Object w/ Conflicting First Device", model=b"some-model", backend="TORCH",
                 device="GPU:123", first_device=456)

    pprint_db_obj(o1)
    pprint_db_obj(o2)
    pprint_db_obj(o3)
    pprint_db_obj(o4)
    pprint_db_obj(o5)
    pprint_db_obj(o6)
    print("=" * 20)

    o7 = DBModel(name="Object where/ First Device should be 123", model=b"some-model", backend="TORCH",
                 device="GPU", first_device=123)

    pprint_db_obj(o7)
    print("     Yikes!! We have no pin for this GPU!!! ^^^")

    return 0

if __name__ == "__main__":
    raise SystemExit(main())

This is the current output:

Normal Object: ['GPU']
Object w/ Device as str: ['GPU:1']
Object w/ First Device: ['GPU']
Object w/ Single Enumerated First Device: ['GPU']
Object w/ Multi Enumerated First Device: ['GPU:1', 'GPU:2', 'GPU:3']
Object w/ Conflicting First Device: ['GPU:123']
====================
Object where/ First Device should be 123: ['GPU']
     Yikes!! We have no pin for this GPU!!! ^^^

Expected behavior

This is what I would expect for output

Normal Object: ['GPU']
Object w/ Device as str: --> N/A: Constructing a DBObject where `device` != `CPU` or `GPU` (case insensitive) should raise an error <--
                         --> NOTE: THIS WOULD CONSTITUTE AN API BREAK <--
Object w/ First Device: ['GPU:1']
Object w/ Single Enumerated First Device: ['GPU:1']
Object w/ Multi Enumerated First Device: ['GPU:1', 'GPU:2', 'GPU:3']
Object w/ Conflicting First Device: --> N/A: Constructing a DBObject where `device` != `CPU` or `GPU` (case insensitive) should raise an error <--
                                    --> NOTE: THIS WOULD CONSTITUTE AN API BREAK <--
====================
Object where/ First Device should be 123: ['GPU:123']
         Hooray! This is what I expected!!  ^^^^^^^

But of course this can be open to discussion/debate. At the very least, I think we should error if a user specifies device="GPU:123" and first_device=456.

System

Any/All

mellis13 commented 8 months ago

Consensus was to raise an error if the device has a ":" specifier and the first device is specified.