Closed senokay closed 1 year ago
@libc-furiosa @yw-furiosa Could you move forward with this issue? It would be great if we can include this improvement to 0.10.0 release. It will be the last issue for this release.
@yw-furiosa Fix find_device_files_in
function please.
Yes I am looking into it now.
Consider the following Python fragment (but the issue itself is unrelated to Python bindings):
This will fail, as expected when there is only a single NPU with two PEs. However the following error message:
...is misleading. as we do have
npu:1:0-1
, we just cannot open them. It should read something like this instead (the actual exception type is subject to change):Though the original message should remain when device is outright non-existent:
The corresponding enum variant to
DeviceError
should be also added.Motivation
Device busy can happen in some furiosa-runtime test cases when the runtime fails to initialize a new session for whatever reason, because the session needs to open the device file and the error might be reported before the device file has been actually closed and available for use again.
While this particular issue can be also "fixed" by delaying the error reporting until the device file is known to be closed, it feels wrong because the runtime may be unable to close the device file and that case wouldn't be indistinguishable from the closed case. In the other words the caller has no actual guarantee anyway (unless the runtime doesn't return, which would be absurd). So the session initialization should retry for a while instead, and you need to distinguish device busy in order to avoid useless retries (e.g. environment variable error).