furiosa-ai / device-api

APIs that offers NPU devices' information and allow to control the devices
Apache License 2.0
5 stars 8 forks source link

Distinguish device busy from other errors #93

Closed senokay closed 1 year ago

senokay commented 1 year ago

Consider the following Python fragment (but the issue itself is unrelated to Python bindings):

import os, asyncio
from furiosa_device import *

async def main():
    fd = os.open("/dev/npu1pe0-1", os.O_RDWR)
    try:
        files = await find_device_files(DeviceConfig.from_str("npu:1:0-1"))
        print(files)
    finally:
        os.close(fd)

asyncio.run(main())

This will fail, as expected when there is only a single NPU with two PEs. However the following error message:

RuntimeError: Device npu:1:0-1 not found

...is misleading. as we do have npu:1:0-1, we just cannot open them. It should read something like this instead (the actual exception type is subject to change):

RuntimeError: Device npu:1:0-1 found but still in use

Though the original message should remain when device is outright non-existent:

RuntimeError: Device npu:1:0-2 not found

The corresponding enum variant to DeviceError should be also added.

Motivation

Device busy can happen in some furiosa-runtime test cases when the runtime fails to initialize a new session for whatever reason, because the session needs to open the device file and the error might be reported before the device file has been actually closed and available for use again.

While this particular issue can be also "fixed" by delaying the error reporting until the device file is known to be closed, it feels wrong because the runtime may be unable to close the device file and that case wouldn't be indistinguishable from the closed case. In the other words the caller has no actual guarantee anyway (unless the runtime doesn't return, which would be absurd). So the session initialization should retry for a while instead, and you need to distinguish device busy in order to avoid useless retries (e.g. environment variable error).

hyunsik commented 1 year ago

@libc-furiosa @yw-furiosa Could you move forward with this issue? It would be great if we can include this improvement to 0.10.0 release. It will be the last issue for this release.

libc-furiosa commented 1 year ago

@yw-furiosa Fix find_device_files_in function please.

yw-furiosa commented 1 year ago

Yes I am looking into it now.