Processing FPS is much lesser than received FPS from the Camera

divdaisymuffin commented 2 years ago

Hi @nnshah1 and @xwu2git ,

We found an observation, where the live camera is streaming at 20 FPS but when we check processed FPS its drops to 10 to 12 FPS. It is mostly observed when we are running 4 pods in the machine. So, Is it a known behavior? How we can improve the processed FPS. Note: 1 sensor is running for 1 analytics pod. We are observing it with almost all the pipelines, but as a sample I am sharing one. { "name": "object_detection", "version": 2, "type": "GStreamer", "template":"rtspsrc udp-buffer-size=212992 name=source ! queue ! rtph264depay ! h264parse ! video/x-h264 ! tee name=t ! queue ! decodebin ! videoconvert name=\"videoconvert\" ! video/x-raw,format=BGRx ! queue leaky=upstream ! gvadetect ie-config=CPU_BIND_THREAD=NO model=\"{models[person-detection-retail-0013][1][network]}\" model-proc=\"{models[person-detection-retail-0013][1][proc]}\" name=\"detection\" threshold=0.40 ! gvametaconvert name=\"metaconvert\" ! queue ! gvapython name=\"StaffEngagement\" module=\"custom_transforms/staff_engagement2\" class=\"StaffEngagement\" ! gvametapublish name=\"destination\" ! tee name = tt ! queue ! gvawatermark ! videoconvert ! jpegenc ! gvapython name=\"capture\" module=\"custom_transforms/staff_engagement2\" class=\"Capture\" ! queue ! appsink name=appsink t. ! queue ! splitmuxsink max-size-time=900000000000 name=\"splitmuxsink\"", "description": "Object Detection Pipeline", "parameters": { "type" : "object", "properties" : { "inference-interval": { "element":"detection", "type": "integer", "minimum": 0, "maximum": 4294967295 }, "cpu-throughput-streams": { "element":"detection", "type": "string" }, "n-threads": { "element":"videoconvert", "type": "integer" }, "nireq": { "element":"detection", "type": "integer", "minimum": 1, "maximum": 64 }, "recording_prefix": { "type":"string", "default":"recording" } } } } Thanks

nnshah1 commented 2 years ago

@divdaisymuffin can you share details on the hardware you are using?

Is it correct that with one pod you get 20FPS, but as you scale to multiple pods, it drops to 10 / 12 fps?

divdaisymuffin commented 2 years ago

@nnshah1 Please find the hardware specs for the machines we are using image (3)

And we are observing it with single pod as well.

nnshah1 commented 2 years ago

@divdaisymuffin Would you be able test the output of gst-inspect-1.0 | grep vaapi

On your targets using: https://hub.docker.com/r/openvino/ubuntu20_data_runtime

That will help us determine if hw accelerated decode / encode can be an option or not.

divdaisymuffin commented 2 years ago

@nnshah1 This is what we are getting: intel1

divdaisymuffin commented 2 years ago

@nnshah1 want to share one observation, you were right that the processed fps is affected by CPU usage, when I am running 4 pods on i7 cpu and the model is yolov3 I am getting processed fps of 1fps, 2 fps, 5 fps like that. And the camera was streaming at 25 fps each. And as soon as I removed 2 pods, my fps got increased to 12 fps. The cpu usage with 4 pods was 89.83%. Similarly when I run an heavy model with single pod also and cpu goes to 80% I have observed 6 fps processing fps with 25 fps as received fps. I will share complete table view also soon.

divdaisymuffin commented 2 years ago

@nnshah1 As suggested by you, I have tried running on GPU, Please find the files: Dockerfile `# smtc_analytics_common_xeon_gst

FROM centos:7 as build

ARG VA_SERVING_REPO=https://raw.githubusercontent.com/intel/video-analytics-serving ARG VA_SERVING_TAG="v0.3.0-alpha"

RUN mkdir -p /home/vaserving/common/utils && touch /home/vaserving/init.py /home/vaserving/common/init.py /home/vaserving/common/utils/init.py && for x in common/utils/logging.py common/settings.py arguments.py ffmpeg_pipeline.py gstreamer_pipeline.py model_manager.py pipeline.py pipeline_manager.py schema.py vaserving.py; do curl -sSf -o /home/vaserving/$x -L ${VA_SERVING_REPO}/${VA_SERVING_TAG}/vaserving/$x; done COPY *.py /home/

FROM openvisualcloud/xeone3-ubuntu1804-analytics-gst:20.10

RUN apt-get update -qq && apt-get install -qq python3-gst-1.0 python3-jsonschema python3-psutil && rm -rf /var/lib/apt/lists/*

COPY --from=build /home/ /home/ ENV FRAMEWORK=gstreamer ENV PYTHONIOENCODING=UTF-8

ARG USER=docker ARG GROUP=docker ARG UID ARG GID

must use ; here to ignore user exist status code

RUN [ ${GID} -gt 0 ] && groupadd -f -g ${GID} ${GROUP}; \ [ ${UID} -gt 0 ] && useradd -d /home -M -g ${GID} -K UID_MAX=${UID} -K UID_MIN=${UID} ${USER}; \ chown -R ${UID}:${GID} /home

`

Pipeline.json { "name": "object_detection", "version": 2, "type": "GStreamer", "template":"rtspsrc udp-buffer-size=212992 name=source ! queue ! rtph264depay ! h264parse ! video/x-h264 ! tee name=t ! queue ! decodebin ! videoconvert name=\"videoconvert\" ! video/x-raw(memory:VASurface) ! vaapipostproc brightness=0.0001 ! queue leaky=upstream ! gvadetect device=GPU pre-process-backend=vaapi model=\"{models[face_detection_adas][1][network]}\" model-proc=\"{models[face_detection_adas][1][proc]}\" name=\"detection\" threshold=0.10 ! gvaclassify model=\"{models[age-gender-recognition-retail-0013][1][network]}\" model-proc=\"{models[age-gender-recognition-retail-0013][1][proc]}\" name=\"recognition\" model-instance-id=recognition ! gvametaconvert name=\"metaconvert\" ! queue ! gvapython name=\"QueueCounting\" module=\"custom_transforms/final_count.py\" class=\"QueueCounting\" ! gvametapublish name=\"destination\" ! appsink name=appsink t. ! splitmuxsink max-size-time=60000000000 name=\"splitmuxsink\"", "description": "Object Detection Pipeline", "parameters": { "type" : "object", "properties" : { "inference-interval": { "element":"detection", "type": "integer", "minimum": 0, "maximum": 4294967295 }, "cpu-throughput-streams": { "element":"detection", "type": "string" }, "n-threads": { "element":"videoconvert", "type": "integer" }, "nireq": { "element":"detection", "type": "integer", "minimum": 1, "maximum": 64 }, "device": { "element": "detection", "default": "GPU", "type": "string" }, "recording_prefix": { "type":"string", "default":"recording" } } } }

Yaml `apiVersion: apps/v1 kind: Deployment metadata: name: traffic-office1-analytics-traffic labels: app: traffic-office1-analytics-traffic spec: replicas: 1 selector: matchLabels: app: traffic-office1-analytics-traffic template: metadata: labels: app: traffic-office1-analytics-traffic spec: enableServiceLinks: false hostNetwork: true dnsPolicy: ClusterFirstWithHostNet containers:

name: traffic-office1-analytics-traffic image: smtc_analytics_object_xeon_gst:latest imagePullPolicy: IfNotPresent env:
- name: OFFICE value: "45.539626,-122.929569"
- name: DBHOST value: "http://db-service:9200"
- name: MQTTHOST value: "traffic-office1-mqtt-service"
- name: STHOST value: "http://traffic-office1-storage-service:8080/api/upload"
- name: MQTT_TOPIC value: "analytics"
- name: EVERY_NTH_FRAME value: "1"
- name: SCENARIO value: "traffic"
- name: NETWORK_PREFERENCE value: "{\"GPU\":\"INT8,FP32\"}"
- name: GST_DEBUG value: "3"
- name: NO_PROXY value: "*"
- name: no_proxy value: "*" volumeMounts:
- mountPath: /etc/localtime name: timezone readOnly: true
- mountPath: /tmp/rec name: recording initContainers:
- image: busybox:latest imagePullPolicy: IfNotPresent name: init command: ["/bin/chown","0:0","/tmp/rec"] volumeMounts:
  - mountPath: /tmp/rec name: recording volumes:
    - name: timezone hostPath: path: /etc/localtime type: File
    - name: recording emptyDir: medium: Memory sizeLimit: 150Mi affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms:
  - matchExpressions:
    - key: "vcac-zone" operator: NotIn values:
      - "yes" `

PROBLEM: analytics.yaml.m4 takes NETWORK_PREFERENCE I think from build.sh, so when I do "cmake" and then "make" again the analytics.yaml again comes with "CPU".

I am also sharing analytics logs where you can see "NETWORK_PREFERENCE==CPU", and also it says "vaapipostproc" no element. issue2 issue1

Please suggest.

divdaisymuffin commented 2 years ago

@xwu2git @nnshah1 I have seen one more thing, which is related to this issue https://github.com/OpenVisualCloud/Dockerfiles/issues/662 That the xeone3-ubuntu1804-analytics-gst:20.10 does not suport comet lake GPU.

grep "model name" /proc/cpuinfo | head -1 model name : Intel(R) Core(TM) i7-10700 CPU @ 2.90GHz docker run -it --device /dev/dri --entrypoint /bin/bash openvisualcloud/xeone3-ubuntu1804-analytics-gst:20.10 -c "clinfo -l"

It does not give any output, but when I ran on another machine which is grep "model name" /proc/cpuinfo | head -1 model name : Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz docker run -it --device /dev/dri --entrypoint /bin/bash openvisualcloud/xeone3-ubuntu1804-analytics-gst:20.10 -c "clinfo -l" Platform #0: Intel(R) OpenCL HD Graphics -- Device #0: Intel(R) Gen9 HD Graphics NEO`

It gives this output, which shows that on comet lake it is not working.

So, Then I shifted to Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz But still no element vaapipostproc error is still there.

I have also set priviledged=True in analytics.yaml.m4 as you suggested but it didn't worked out

Please help to resolve.

nnshah1 commented 2 years ago

To support comet lake via opencl you will need an updated driver. I've created a branch here to show the necessary modifications to the Dockerfile:

https://github.com/nnshah1/Smart-City-Sample/commit/1df8ef5de3da69970048063a54fc96a8c441add4

for vaapipostproc - is this issue seen when running the container as above interactively - or only when running via kubernetes?

nnshah1 commented 2 years ago

On a comet lake system I recommend the following to achieve 4 streams at 30fps:

key things: use FP16-INT8 model (for person-detection-retail-0013), use CPU_THREADS_NUM and cpu-throughtput-streams to limit number of threads being used per process. Use vaapi for decode and use MULTI:CPU,GPU for device.

gst-launch-1.0 filesrc location=/home/pipeline-zoo/workspace/od-h264-people/workloads/2m.264/disk/input/stream.h264 ! video/x-h264 ! h264parse ! video/x-h264 ! vaapih264dec name=decode0 ! vaapipostproc ! video/x-raw(memory:VASurface) ! gvadetect device=MULTI:CPU,GPU pre-process-backend=vaapi gpu-throughput-streams=1 cpu-throughput-streams=1 ie-config=CPU_THREADS_NUM=1,CPU_BIND_THREAD=NO nireq=4 name=detect0 model=/home/pipeline-zoo/workspace/od-h264-people/models/person-detection-retail-0013/FP16-INT8/person-detection-retail-0013.xml ! gvametaconvert add-empty-results=true ! gvametapublish method=file file-format=json-lines file-path=/tmp/216db4ac-52b1-11ec-a9a6-1c697aa336da/output ! gvafpscounter ! fakesink async=false name=sink0

tthakkal commented 2 years ago

@divdaisymuffin Please check changes here required to enable GPU. Below steps to verify it works.

Build and Run Smart-City-Sample from Neelay's fork.

Pull and build from Neelay's fork. https://github.com/nnshah1/Smart-City-Sample/tree/updated_opencl_driver, remember to checkout to branch updated_opencl_driver

git clone https://github.com/nnshah1/Smart-City-Sample.git
cd Smart-City-Sample
checkout updated_opencl_driver
mkdir build
cd build
cmake ..
make
make start_kubernetes

Exec into `analytics-traffic` container

Find and exec into analytics-traffic container, replace k8s_traffic-office1-analytics-traffic_traffic-office1-analytics-traffic-6f9497b9c4-g52lk_default_5e9009d2-486a-4114-8b54-745c31395c78_0 with your container name

sudo docker exec -it k8s_traffic-office1-analytics-traffic_traffic-office1-analytics-traffic-6f9497b9c4-g52lk_default_5e9009d2-486a-4114-8b54-745c31395c78_0 /bin/bash

Inspect `vaapipostproc`

gst-inspect-1.0 vaapipostproc

Test `vaapipostproc` and `device=MULTI:CPU,GPU` by running pipeline below inside container

gst-launch-1.0 urisourcebin \
uri=https://github.com/intel-iot-devkit/sample-videos/blob/master/person-bicycle-car-detection.mp4?raw=true \
! decodebin ! vaapipostproc ! "video/x-raw(memory:VASurface)" \
! gvadetect device=MULTI:CPU,GPU pre-process-backend=vaapi gpu-throughput-streams=1 \
cpu-throughput-streams=1 ie-config=CPU_THREADS_NUM=1,CPU_BIND_THREAD=NO nireq=4 name=detect0 \
model=/home/models/person_detection_2020R2/1/FP16/person-detection-retail-0013.xml \
model-proc=/home/models/person_detection_2020R2/1/person-detection-retail-0013.json \
! gvametaconvert add-empty-results=true ! gvametapublish method=file file-format=json-lines file-path=/tmp/output \
! gvafpscounter ! fakesink async=false name=sink0

divdaisymuffin commented 2 years ago

@tthakkal @nnshah1 Thanks for the support, we are able to run above given pipeline inside docker or kubernetes pod container and it is utilising GPU successfully. But still when we are trying to run pipeline defined inside Xeon/gst/pipeline/2/pipeline.json, that fails to run with certain syntax errors or element unsupported.

I am sharing my pipeline with you, Please help me in correcting it and running it in Smart-City-Sample.

{ "name": "ppl-density-det", "version": 2, "type": "GStreamer", "template":"rtspsrc udp-buffer-size=212992 name=source ! queue ! rtph264depay ! h264parse ! video/x-h264 ! tee name=t ! queue ! decodebin ! videoconvert name=\"videoconvert\" ! vaapipostproc ! video/x-raw(memory:VASurface) ! queue leaky=upstream ! gvadetect device=MULTI:CPU,GPU pre-process-backend=vaapi gpu-throughput-streams=1 cpu-throughput-streams=1 ie-config=CPU_THREADS_NUM=1,CPU_BIND_THREAD=NO model=\"{models[head_yolov4_tiny_608to416_default_anchors_mask_012_heatmap_INT8][1][network]}\" model-proc=\"{models[head_yolov4_tiny_608to416_default_anchors_mask_012_heatmap_INT8][1][proc]}\" name=\"detection\" threshold=0.40 ! gvametaconvert name=\"metaconvert\" ! queue ! gvametapublish name=\"destination\" ! gvafpscounter ! appsink name=appsink t. ! splitmuxsink max-size-time=300000000000 name=\"splitmuxsink\"", "description": "ppl-density-det Pipeline", "parameters": { "type" : "object", "properties" : { "inference-interval": { "element":"detection", "type": "integer", "minimum": 0, "maximum": 4294967295 }, "cpu-throughput-streams": { "element":"detection", "type": "string" }, "n-threads": { "element":"videoconvert", "type": "integer" }, "nireq": { "element":"detection", "type": "integer", "minimum": 1, "maximum": 64 }, "recording_prefix": { "type":"string", "default":"recording" } } } }

Please find attached logs as well.

tthakkal commented 2 years ago

@divdaisymuffin Replace model=\"{models[head_yolov4_tiny_608to416_default_anchors_mask_012_heatmap_INT8][1][network]}\" with model=\"{models[head_yolov4_tiny_608to416_default_anchors_mask_012_heatmap_INT8][1][FP16][network]}\" assuming you have model in FP16 directory. I have tested with FP16 precision, if you want to try with different precision, you can change that to INT8 or FP32 and see that works and/or betters performance.

divdaisymuffin commented 2 years ago

@tthakkal @nnshah1 Yes, finally it is working and utilizing GPU as well, but to our surprise, it is using CPU as well.

neelay

The above image is been seen using kubectl apply -f https://raw.githubusercontent.com/pythianarora/total-practice/master/sample-kubernetes-code/metrics-server.yaml kubectl top po

Cant we restrict the use of CPU?

tthakkal commented 2 years ago

CPU_THREADS_NUM=1 and cpu-throughput-streams=1 should restrict use of CPU for gvadetect, CPU usage you are seeing might be from other elements or processes. You can verify it by removing those values and see if CPU go up from the current number.

I see decodebin and videoconvert aren't really needed in your pipeline. You can remove this ! decodebin ! videoconvert name=\"videoconvert\", that should help in reducing some usage.

By the way, do you see 4 streams at better fps now?

divdaisymuffin commented 2 years ago

@tthakkal @nnshah1 yes, the CPU_THREADS_NUM=1 and removal of videoconvert helped to reduce the CPU utilization, I tried with without GPU there I am seeing improvement, although we are not able to remove decodebin it gives error, but I need to understand why you said that "decodebin and videoconvert aren't really needed in your pipeline"

And yes now we are getting better FPS with these suggestions even without using GPU.

tthakkal commented 2 years ago

@divdaisymuffin I am sorry, it's my mistake decodebin or avdec_h264 is required for decoding. videoconvert isn't required because vaapipostproc is doing required conversion to video/x-raw(memory:VASurface)

divdaisymuffin commented 2 years ago

@tthakkal what if we remove videoconvert from the CPU pipeline as well, because we have done this and the CPU utilization decreased by 50%. Although a little decrease in detection accuracy of model observed but only 1 to 2 %. I am sharing my CPU pipeline without videoconvert and addition of CPU_THREADS_NUM=1 let me know if that is not suggested.

"template":"rtspsrc udp-buffer-size=212992 name=source ! queue ! rtph264depay ! h264parse ! video/x-h264 ! tee name=t ! queue ! decodebin ! queue leaky=upstream ! gvadetect model=\"{models[head_yolov4_tiny_608to416_default_anchors_mask_012_heatmap_INT8][1][network]}\" model-proc=\"{models[head_yolov4_tiny_608to416_default_anchors_mask_012_heatmap_INT8][1][proc]}\" name=\"detection\" ie-config=CPU_THREADS_NUM=1 threshold=0.40 ! gvametaconvert name=\"metaconvert\" ! queue ! gvapython name=\"new_wait\" module=\"custom_transforms/new_wait\" class=\"WaitTime\" ! gvametapublish name=\"destination\" ! appsink name=appsink t. ! queue ! splitmuxsink max-size-time=300000000000 name=\"splitmuxsink\"",

tthakkal commented 2 years ago

@divdaisymuffin that should work without any issues. videoconvert is only required where decodebin isn't providing caps format that gvadetect supports.

divdaisymuffin commented 2 years ago

@tthakkal Thanks for the clarity.

nnshah1 commented 2 years ago

@divdaisymuffin Please confirm the pipelines are now getting the correct density and utilization on your target hardware. If so we'll close this issue and can open others as needed.

divdaisymuffin commented 2 years ago

@nnshah1 yes its working good we can close this

OpenVisualCloud / Smart-City-Sample