Closed Jannick-v closed 4 months ago
a) could you move to the upcoming pypylon 4.0 ( available as 4.0.0rc2 on pypi) so we can check with latest fixes we fixed part of the zerocopy handling and we link to latest pylon gigevision drivers b) can you run your application using heaptrack
apt-get install heaptrack heaptrack_gui
this will alow you to generate a fine grained trace of the main allocators that leak in your example
You set GevSCPD 0
this is the delay time between gigevision packets.
As your camera is capable to sent packets with the minimum inter packet gap of 96ns your receiver NIC and Linux kernel stack might be overrun.
The typical procedure to set this value is described here: https://docs.baslerweb.com/knowledge/how-to-troubleshoot-lost-packets-or-frames-while-using-gige-cameras
The idea is to increase the space between network packets ( and thus decrease the pressure on your hardware and software stack ) up to the point the virtual framerate of your linescan camera decreases.
I see, that you set a very very large number of buffers in your application. In a gigevision use case, where there is always a small packet queue to the hardware and from there the data is copied to your application buffers. So adding more buffers in a continous acquisition use case is only needed, if your image processing sporadically is slower than the incoming image rate and you have to be sure that there will always buffers ready to be filled by the hardware packet queue.
Hi @thiesmoeller Thanks for the reply and explanation. The increased number of buffers is because we were experiencing corrupt lines from time to time (there were sections of the image that were just completely overwritten by the data content of another image) in pypylon 1.9.
The following error message in pypylon 1.9 made us increase all these buffers.
3774873620: The buffer was incompletely grabbed. This can be caused by performance problems of the network hardware used, i.e., the network adapter, switch, or Ethernet cable. Buffer underruns can also cause image loss. To fix this, use the pylonGigEConfigurator tool to optimize your setup and use more buffers for grabbing in your application to prevent buffer underruns.
I must admit that I am not fully understanding how all the buffer handling, loading and offloading works behind the scenes. I once found a document that was high-level describing the internals, but I currently fail to find it back. If you know where to find it, please share :)
We upgraded to pypylon 3.0.1 and also installed the pylon distribution. The error message changed a little, pointing exactly to the GevSCPD
parameter you already mentioned in your reply :muscle:
I have had to change the parameter to 125000 ticks
, which is the equivalent of 1 ms
for our camera to get things to work in the full application (which runs in a k3s pod).
The value is rather high, but it seems to work..
The buffer was incompletely grabbed. This can be caused by performance problems of the network hardware used, i.e. network adapter, switch, or ethernet cable. To fix this, try increasing the camera's Inter-Packet Delay in the Transport Layer category
Last but not least, I finally found the time to do a quick test. I will e-mail the heaptrack file (since I cannot upload it to git)
For your target k3s environment: Which CNI do you use and in which configuration?
Buffer management is explained here https://docs.baslerweb.com/pylonapi/cpp/pylon_programmingguide#the-default-grab-strategy-one-by-one
Thanks for the link! I hope you have succesfully received the file? We have an open support ticket (ref 74135710, + I took the liberty to already include you in mail trace) with more details, but currently we have the following:
We have tried different k3s flannel, multus, macvlan versions and different cni spec versions but without success.
Eventually, the below combo (which does not necessarily mean that we use the latest versions) + pypylon 1.9.0
+ GevSCPD 125000
seems to be the most stable combo so far.
mic-733ao@ubuntu:~$ /var/lib/rancher/k3s/data/current/bin/multus --version
multus: version:v4.0.2(clean,released), commit:f03765681fe81ee1e0633ee1734bf48ab3bccf2b, date:2023-05-25T13:40:20+00:00
mic-733ao@ubuntu:~$ /var/lib/rancher/k3s/data/current/bin/flannel
CNI Plugin flannel version v0.18.1 (linux/arm64) commit 990ba0e88c90f8ed8b50e0ccd375937b841b176e built on 2022-07-19T01:08:03Z
mic-733ao@ubuntu:~$ /var/lib/rancher/k3s/data/current/bin/macvlan
CNI macvlan plugin v1.5.1
CNI protocol versions supported: 0.1.0, 0.2.0, 0.3.0, 0.3.1, 0.4.0, 1.0.0
mic-733ao@ubuntu:~$ k3s --version
k3s version v1.24.3+k3s1 (990ba0e8)
go version go1.18.1
root@ubuntu:/home/mic-733ao# kubectl -n kube-system get network-attachment-definition macvlan-local-link -o yaml
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
annotations:
objectset.rio.cattle.io/applied: H4sIAAAAAAAA/4yRTYvbQAyG/4rR2XbtfJjY0EOgt0KPPfkiy5pk6rHGeFSHEPLfy9TZZVlI2OO8D3r0irkBTvY3z8F6gQaGQ8hJbE5CJrf+21JCCoOVHhr4xXrx83BURTqPLPqDjRWrcTKFkRV7VITmBijiFSMI8em7P0waWPPZ+pxQ1XGU22iF9Cn3F+E5Oy1DLLYNH8hSpslPK/33Y9//3/5aITgyNDAiLQ4lE9YM36/40nSYkKJi+NtxFq5BeYR7Cg47di9vPGM4QwO1MUXRmZq7/WZfmaIiQ1WPSFsy1bas8dCXvKt3UfqprfOELnNWBljZky5hYopNyIuxJ2jglrRA8va5LTRJC0W+y4sW0qQFvU68ho9NazxiUJ5X4FA2jxTu938BAAD//7Okz+grAgAA
objectset.rio.cattle.io/id: ""
objectset.rio.cattle.io/owner-gvk: k3s.cattle.io/v1, Kind=Addon
objectset.rio.cattle.io/owner-name: macvlan-net-attachment
objectset.rio.cattle.io/owner-namespace: kube-system
creationTimestamp: "2024-07-02T16:48:41Z"
generation: 7
labels:
objectset.rio.cattle.io/hash: 9ff00bf9eb5256f06cfc6daac3cf6319a8d1e494
name: macvlan-local-link
namespace: kube-system
resourceVersion: "61489"
uid: f4d81b2b-cf3b-4f47-846a-71661c6824de
spec:
config: '{ "cniVersion": "0.4.0", "type": "macvlan", "master": "lan2", "mode": "bridge",
"mtu": 9000, "ipam": { "type": "host-local", "ranges": [ [ { "subnet": "169.254.0.0/16",
"rangeStart": "169.254.184.200", "rangeEnd": "169.254.184.220" } ] ] } }'
Because we are nearing the project deadline, I have little room to keep on changing and testing things. Nevertheless, I still would like to see how we could improve in the future, so I am all ears :)
edit: add more version info
There is no significant memory leak in your trace. Only leaks are in python itself ( maybe false positive )
You allocate in your code 10000 Buffers of 420x800 pix Mono8 and end up with 3.4GB Memory.
I assume, that you run into this, when your docker/k3s env fails
haha ok, that's pretty embarrassing :sweat_smile:
just did a test without the set_buffer_sizes
and can confirm.
We are still looking into the networking part, because an inter packet delay
of 1 ms
still seems pretty high..
Using MacVlan as cni in your case should give you the right performance.
Important is to use a CNI that routes the packets as early as possible into the pod.
Take care to increase the Rx mem size as you did.
Explanation to why it was not working with packet delay 0 and chunks:
The image data was " naturally" throttled by the line rate of your sensor. The chunks are send from camera buffers at full speed at the end of the transfer and are too close back to back for your host
All right. Thanks for the explanation.
Very much appreciated.
I think we can close this issue, since it was a False
alarm.
Describe the issue:
I am currently facing several issues in one of my projects. A potential memory leak is one of them. In an attempt to isolate the problem, I have tried to extract a MWE from the application code.
The real application runs in a
k3s
pod on a Jetson device. However, for debugging purposes I am currently moving from bare metal (conda env) --> docker env--> k3s env. When running in a docker container, the container crashes with an OOM.After closer inspection, I have noticed a similar memory increase when running on bare metal too. But since the system has 32 GB of memory, the build-up did go unnoticed first.
The following docker run flags are used:
--network=host
: to make sure we can reach the camera--memory=300m
: to limit the memory: the real application runs in k3s, where memory limits should be respected too.I've tested this with a few docker base images, always with the same result:
python:3.8.16-slim
featuring python 3.8nvcr.io/nvidia/l4t-base:r36.2.0
featuring python 3.10When running inside the docker container, I can see a stead memory build-up happening using
docker stats
, until the docker container eventually crashes due to an OOM after a few minutesThe camera (a
ral4096-24gm
Linescan camera) is currently producing 800x420 mono images at a rate of ~ 4 FPS. (line rate ~3003.0 Hz)Reproduce the code example:
Is your camera operational in Basler pylon viewer on your platform
Yes
Hardware setup & camera model(s) used
compute device:
camera:
Runtime information:
pfs file: