dora-rs / dora-drives

A step-by-step tutorial that allows beginners to write their own autonomous vehicle program from scratch using a simple starter kit. Dora-drives makes learning autonomous vehicle systems faster and easier.
https://www.dora-rs.ai/docs/guides/dora-drives/
Apache License 2.0
48 stars 11 forks source link

Opening the shared memory failed, os error 24 #54

Open meua opened 1 year ago

meua commented 1 year ago

Describe the bug

frame:  (1080, 1920, 4)
img:  (1080, 1920, 3)
output:  [[ 3.6737819  3.6716187  3.6644292 ... 12.203822  12.108404  12.077164 ]
 [ 3.6688473  3.6667528  3.6598594 ... 12.227848  12.146185  12.11989  ]
 [ 3.6578336  3.655894   3.6496665 ... 12.287019  12.240098  12.226307 ]
 ...
 [53.57742   53.672173  53.898468  ... 83.08049   83.24545   83.30725  ]
 [53.411537  53.528187  53.81199   ... 83.325745  83.47475   83.5309   ]
 [53.344387  53.46946   53.775078  ... 83.41604   83.56092   83.615654 ]]
Traceback (most recent call last):
  File "<string>", line 1, in <module>
RuntimeError: Dora Runtime raised an error.

Caused by:
   0: main task failed
   1: received error event: failed to map shared memory input

      Caused by:
          Opening the shared memory failed, os error 24

      Location:
          apis/rust/node/src/event.rs:64:14

Location:
    binaries/runtime/src/lib.rs:316:34
(dora3.7) jarvis@jia:~/coding/dora_home/dora-drives$ Traceback (most recent call last):
  File "<string>", line 1, in <module>
RuntimeError: Dora Runtime raised an error.

Caused by:
   0: main task failed
   1: failed to send node output
   2: failed to allocate shared memory
   3: Creating the shared memory failed, os error 24

Location:
    apis/rust/node/src/node.rs:169:22

To Reproduce Steps to reproduce the behavior:

  1. Dora start daemon: dora up
  2. Start a new dataflow: dora start graphs/tutorials/webcam_single_dpt_frame.yaml --attach --hot-reload --name webcam-midas

Expected behavior A clear and concise description of what you expected to happen.

Screenshots or Video image

Environments (please complete the following information):

haixuanTao commented 1 year ago

Would be great if you could share the code as well. Thanks :)

Specifically: graphs/tutorials/webcam_single_dpt_frame.yaml

meua commented 1 year ago

Would be great if you could share the code as well. Thanks :)

Specifically: graphs/tutorials/webcam_single_dpt_frame.yaml

Ok, I submitted the related PR

phil-opp commented 1 year ago

Thanks for reporting!

Ok, I submitted the related PR

You're talking about https://github.com/dora-rs/dora-drives/pull/55, right?

Regarding the error:

Did you see any warnings in the logs? There are some situations where we will unmap shared memory regions after some timeout if the receiver did not react as expected. If this happened, you should see a warning in the log output. (@haixuanTao Do we have the tracing to stdout enabled for Python by default? )

Given that the shared memory allocation failed too, it is more likely that the issue is the number of open files. There is typically a limit on the number of open file handles, which you can query using ulimit -n. We're currently allocating each message as a separate shared memory region (which requires a file handle), so it's easy to exhaust this limit if you have many messages in transit. To work around this, you can temporarily double the file limit by running ulimit -n 2048, larger values are possible too.

To fix this properly, we should reduce the number of allocated shared memory regions and reuse the same region for mulitple messages. I opened https://github.com/dora-rs/dora/issues/268 for that.

haixuanTao commented 1 year ago

@phil-opp , so trace goes to stdout with export RUST_LOG=trace, the only case they don't is if we also activate DORA_JAEGER_TRACING

phil-opp commented 1 year ago

Ok good. And the default log level is warn, right? Then it sounds like the file handle number is the issue.

haixuanTao commented 1 year ago

If the environment variable is empty or not set, or if it contains only invalid directives, a default directive enabling the ERROR level is added.

The default is the same as Tokio tracing default which is error. We can change it to warn.

meua commented 1 year ago

The original reason for triggering the #54 problem is that the bytes data (numpy array) sent by send_output is relatively large. Now I have replaced the sent content according to haixuanTao's opinion, So the code for this problem does not appear now. To reproduce this problem, the code in dora-drives/operators/single_dpt_op.py needs to be modified as follows:

                prediction = torch.nn.functional.interpolate(
                    prediction.unsqueeze(1),
                    size=img.shape[:2],
                    mode="bicubic",
                    align_corners=False,
                ).squeeze()

                depth_output = prediction.cpu().numpy()
                print("depth_output: ", depth_output)
                send_output("depth_frame", depth_output.tobytes(), dora_input["metadata"])

The content of depth_output is relatively large, which is more likely to trigger this problem.

phil-opp commented 1 year ago

The default is the same as Tokio tracing default which is error. We can change it to warn.

This would be a good idea in my opinion. We're using warnings in dora to log abnormal events that are not critical yet, but should still be observed by users.

phil-opp commented 1 year ago

@meua Thanks a lot for the info!

phil-opp commented 1 year ago

What's the status of this? Can we still reproduce the "failed to map shared memory input" error with the latest version?

meua commented 1 year ago

What's the status of this? Can we still reproduce the "failed to map shared memory input" error with the latest version?

I don't have time to test it now, I will verify it later when I have a chance.