eclipse / kuksa.val

kuksa.val
Apache License 2.0
90 stars 51 forks source link

DBC Feeder and HVAC Service hungry for CPU resources #508

Open mikehaller opened 1 year ago

mikehaller commented 1 year ago

Running the containerized versions of the example applications DBC Feeder and HVAC Service takes up a lot of CPU resources continuously.

For the DBC Feeder, this should definitely being investigated and optimized. Of course, running it in a loop may use resources, but it seems to be running at 100% all the time, which is way too much.

The HVAC Service is also taking up lots of CPU resources. It's not clear why this uses CPU at all, since it's just waiting for requests or signals and doesn't do anything on its own.

top snippet:

  1137 root      20   0   94252  74292  15024 R  98.7   3.7   0:26.87 dbcfeed+ 
   1119 root      20   0  552504  27364  12224 S  61.8   1.4   0:18.92 hvacser+ 

The containers have been installed on Eclipse Kanto Container Management like this:

kanto-cm create --name databroker \
    --ports=30555:55555/tcp \
    ghcr.io/eclipse/kuksa.val/databroker:0.3.0
kanto-cm start --name databroker

kanto-cm create --name feedercan \
    --e=VEHICLEDATABROKER_DAPR_APP_ID=databroker \
    --e=VDB_ADDRESS=databroker-host:30555 \
    --e=USECASE=databroker \
    --e=LOG_LEVEL=info \
    --e=databroker=info \
    --e=broker_client=info \
    --e=dbcfeeder=info \
    --hosts="databroker-host:host_ip" \
    ghcr.io/eclipse/kuksa.val.feeders/dbc2val:v0.1.1
kanto-cm start --name feedercan

kanto-cm create --name hvac \
    --e=VEHICLEDATABROKER_DAPR_APP_ID=databroker \
    --e=VDB_ADDRESS=databroker-host:30555 \
    --hosts="databroker-host:host_ip" \
    ghcr.io/eclipse/kuksa.val.services/hvac_service:v0.1.0
kanto-cm start --name hvac

The runtime logs of the containers do not show any warnings or errors, it seems to be running as usual:

/var # tail -f ./lib/container-management/containers/a8d09a9b-7b32-4edf-8149-8cc
892b0b5c5/json.log
{"stream":"stderr","log":"INFO:hvac_service:[databroker-host:30555] Connectivity changed to: ChannelConnectivity.TRANSIENT_FAILURE\n","time":"2023-03-13T13:25:00.445768084Z"}
{"stream":"stderr","log":"INFO:hvac_service:Disconnected from data broker\n","time":"2023-03-13T13:25:00.446928774Z"}
{"stream":"stderr","log":"INFO:hvac_service:Connecting to Data Broker [databroker-host:30555]\n","time":"2023-03-13T13:30:18.023783927Z"}
{"stream":"stderr","log":"INFO:hvac_service:Starting HVAC Service on 0.0.0.0:50052\n","time":"2023-03-13T13:30:18.026796054Z"}
{"stream":"stderr","log":"INFO:hvac_service:Using gRPC metadata: (('dapr-app-id', 'databroker'),)\n","time":"2023-03-13T13:30:20.095109973Z"}
{"stream":"stderr","log":"INFO:hvac_service:[databroker-host:30555] Connectivity changed to: ChannelConnectivity.IDLE\n","time":"2023-03-13T13:30:20.108903903Z"}
{"stream":"stderr","log":"INFO:hvac_service:Connected to data broker\n","time":"2023-03-13T13:30:20.110818942Z"}
{"stream":"stderr","log":"INFO:hvac_service:Try register datapoints\n","time":"2023-03-13T13:30:20.113474482Z"}
{"stream":"stderr","log":"INFO:hvac_service:datapoints are registered.\n","time":"2023-03-13T13:30:20.231754158Z"}
{"stream":"stderr","log":"INFO:hvac_service:[databroker-host:30555] Connectivity changed to: ChannelConnectivity.READY\n","time":"2023-03-13T13:30:20.235362125Z"}

/var # tail -f ./lib/container-management/containers/0daeadbb-db81-46a3-bf2d-303
dfde84599/json.log
{"stream":"stderr","log":"2023-03-13 13:31:16,727 INFO databroker: Vehicle.Cabin.Door.Row2.Left.IsOpen was already registered with id 122, type 1\n","time":"2023-03-13T13:31:16.729195089Z"}
{"stream":"stderr","log":"2023-03-13 13:31:16,739 INFO databroker: Vehicle.Cabin.Seat.Row1.Pos1.IsBelted was already registered with id 217, type 1\n","time":"2023-03-13T13:31:16.741050059Z"}
{"stream":"stderr","log":"2023-03-13 13:31:16,750 INFO databroker: Vehicle.Cabin.Seat.Row1.Pos2.IsOccupied was already registered with id 259, type 1\n","time":"2023-03-13T13:31:16.752966779Z"}
{"stream":"stderr","log":"2023-03-13 13:31:16,760 INFO databroker: Vehicle.Body.Lights.IsBrakeOn was already registered with id 42, type 1\n","time":"2023-03-13T13:31:16.762949464Z"}
{"stream":"stderr","log":"2023-03-13 13:31:16,771 INFO databroker: Vehicle.Body.Lights.IsRearFogOn was already registered with id 49, type 1\n","time":"2023-03-13T13:31:16.77429595Z"}
{"stream":"stderr","log":"2023-03-13 13:31:16,782 INFO databroker: Vehicle.Body.Lights.IsBackupOn was already registered with id 41, type 1\n","time":"2023-03-13T13:31:16.784689046Z"}
{"stream":"stderr","log":"2023-03-13 13:31:16,792 INFO databroker: Vehicle.Body.Lights.IsRightIndicatorOn was already registered with id 50, type 1\n","time":"2023-03-13T13:31:16.794473615Z"}
{"stream":"stderr","log":"2023-03-13 13:31:16,803 INFO databroker: Vehicle.Body.Lights.IsLeftIndicatorOn was already registered with id 46, type 1\n","time":"2023-03-13T13:31:16.805369911Z"}
{"stream":"stderr","log":"2023-03-13 13:31:16,816 INFO databroker: Vehicle.Trailer.IsConnected was already registered with id 801, type 1\n","time":"2023-03-13T13:31:16.818320087Z"}
{"stream":"stderr","log":"2023-03-13 13:31:16,829 INFO databroker: Vehicle.OBD.EngineLoad was already registered with id 575, type 10\n","time":"2023-03-13T13:31:16.831532314Z"}

In comparison, some CPU usage snapshots (top, rounded up)

SebastianSchildt commented 1 year ago

is this with dbcfeeder idle or with pushing some data. if so, via logfile or via a vcan?

mikehaller commented 1 year ago

Just starting the container. I assume it's running the canreplay loop?

SebastianSchildt commented 1 year ago

dbcfeeder is a bit heavy, but I would not expect that much (I just tested "natively" without container and got 20% load on a single core, but that was a laptop).

Need to check container

Are you using a Pi, or if not can you give some hints what kind of compute power you have?

@erikbosch As you have recently worked on this, do you have any insights?

maybe we need to profile this again, to see, where some gains are. With the way the python can lib and python DBC tooling is working, I'd always expect this to be heavy, but we should make sure we are not messing up somewhere (the queueing had been an issue in the past).

SebastianSchildt commented 1 year ago

And let's focus on DBC feeder first, I'd guess HVAC might be ripe for a serious rewrite/update anyway.....

erikbosch commented 1 year ago

Some thoughts. First I did some refactoring recently for the changed DBC config handling which could affect CPU usage somewhat, but there should as far as I know not give any significant changes. Secondly, when using a CAN-dump file there is a (likely) inefficient phase in the beginning where all messages from the dump is loaded into memory which occupy quite a lot of CPU for a few seconds. But after that when running (not in a container) and running in cProfile with default dump file the feeder seemed to spend most time waiting. But that may of course look different on a different platform.

With a quick analysis I could not find any obvious bottlenecks. If needed one could spend more time profiling and compare how CPU time is affected by e.g. changing config on how often data shall be sent to databroker.

erik@debian3:~/kuksa.val.feeders/dbc2val$ more t.txt
         8286341 function calls (8204530 primitive calls) in 235.560 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    429/1    0.004    0.000  235.715  235.715 {built-in method builtins.exec}
        1    0.000    0.000  235.715  235.715 <string>:1(<module>)
        1    0.000    0.000  235.715  235.715 runpy.py:199(run_module)
        1    0.000    0.000  235.714  235.714 runpy.py:63(_run_code)
        1    0.000    0.000  235.713  235.713 dbcfeeder.py:1(<module>)
        1    0.000    0.000  235.146  235.146 dbcfeeder.py:359(main)
        1    0.000    0.000  235.142  235.142 dbcfeeder.py:119(start)
        1    0.287    0.287  231.658  231.658 dbcfeeder.py:246(_run)
    10503    0.121    0.000  202.873    0.019 queue.py:154(get)
5559/5558    0.069    0.000  202.609    0.036 threading.py:288(wait)
32738/32736  202.515    0.006  202.521    0.006 {method 'acquire' of '_thread.lock' objects}
    10502    0.228    0.000   26.052    0.002 databroker.py:71(update_datapoint)
    10502    0.248    0.000   25.729    0.002 __init__.py:743(set)
    10534    0.061    0.000   24.015    0.002 _channel.py:937(__call__)
    10534    0.142    0.000   23.889    0.002 _channel.py:919(_blocking)
    10534   19.717    0.002   19.989    0.002 {method 'next_event' of 'grpc._cython.cygrpc.SegregatedCall' objects}
        1    0.301    0.301    2.695    2.695 canplayer.py:42(__init__)
    10534    2.485    0.000    2.589    0.000 {method 'segregated_call' of 'grpc._cython.cygrpc.Channel' objects}
   191832    1.770    0.000    2.340    0.000 canutils.py:50(__iter__)
       65    0.003    0.000    1.467    0.023 __init__.py:1(<module>)
    10502    0.202    0.000    1.231    0.000 __init__.py:508(_prepare_set_request)
        2    1.203    0.601    1.203    0.601 {built-in method time.sleep}
    10502    0.126    0.000    0.974    0.000 __init__.py:424(to_message)
    10502    0.118    0.000    0.935    0.000 dbc2vssmapper.py:116(transform_value)
    10502    0.177    0.000    0.804    0.000 __init__.py:384(to_message)
        1    0.000    0.000    0.771    0.771 dbcreader.py:33(__init__)
SebastianSchildt commented 1 year ago

I remember python queues are a bit finicky, and seem to busy loop in certain conditions, did you try whether removing the timeout here https://github.com/eclipse/kuksa.val.feeders/blob/2bcfabdc9a507772a59ce4ee9ff4c21582cdeab7/dbc2val/dbcfeeder.py#L276

changes anyting? If the queue is blocking, should be fine, if you have a non_blocking queue, the trick was in case of a negative response add your own tiny sleep (1 ms).

erikbosch commented 1 year ago

Changing the queue did not do anything, the call is blocking so a timeout of 1 only means that you will return to the outer loop more often.

Anyhow, I did some more profiling with yappi to get some more "threading" information. It seems that tx-worker (which sends around 250 CAN-signals per second on virtual can when using a candump) is responsible for quite a lot of load. That part shall not be a concern for dbcfeeder if using a "real" CAN source. Some CPU can possibly be saved by refactoring frequent debug-prints (not using f-strings) or a smarter time throttling (filter on raw can id already before dbc conversion, today throttling is performed after DBC conversion based on individual SDL signals. Another possible improvement could be to not use individual gRPC-calls for each update, but rather send them in batches. Exactly how much that could save is however somewhat unclear

Clock type: CPU
Ordered by: totaltime, desc

name                                  ncall  tsub      ttot      tavg      
..canplayer.py:42 CANplayer.__init__  1      0.387402  5.252725  5.252725
..s.py:50 CanutilsLogReader.__iter__  191..  2.960808  4.750074  0.000025
..ib/python3.10/threading.py:942 run  4/3    0.000042  4.398443  1.099611
..canplayer.py:75 CANplayer.txWorker  1      0.505677  3.931559  3.931559
..er.py:71 Provider.update_datapoint  2418   0.094510  2.664002  0.001102
..grpc/__init__.py:743 VSSClient.set  2418   0.099178  2.525617  0.001045
..ces/virtual.py:105 VirtualBus.send  10010  0.222000  2.517890  0.000252
..ages/can/bus.py:80 VirtualBus.recv  10014  0.210398  1.991252  0.000199
..ib/python3.10/copy.py:128 deepcopy  875..  0.704782  1.799281  0.000021
.. _UnaryUnaryMultiCallable.__call__  2450   0.024782  1.778202  0.000726
../python3.10/queue.py:154 Queue.get  12432  0.339168  1.768578  0.000142
.._UnaryUnaryMultiCallable._blocking  2450   0.061932  1.734419  0.000708
..dbcreader.py:33 DBCReader.__init__  1      0.000023  1.675066  1.675066
..database/__init__.py:104 load_file  1      0.000021  1.666321  1.666321
..ools/database/__init__.py:261 load  1      0.000007  1.665539  1.665539
..tabase/__init__.py:295 load_string  1      0.000006  1.664382  1.664382
../__init__.py:345 load_can_database  1      0.000019  1.664376  1.664376
..ase.py:210 Database.add_dbc_string  1      0.001186  1.664348  1.664348
..al.py:94 VirtualBus._recv_internal  10014  0.096321  1.615935  0.000161
..an/formats/dbc.py:1960 load_string  1      0.002640  1.378636  1.378636
..essage.py:215 Message.__deepcopy__  10010  0.150132  1.181531  0.000118
..ase.py:451 Database.decode_message  10010  0.140969  1.044516  0.000104
erikbosch commented 1 year ago

@SebastianSchildt - I think we need to decide how to continue on this one. Using a dump with about 250 CAN-messages every second will certainly cause some CPU if we are to feed and read 250 messages per second to a virtual CAN bus. There are some improvements we possibly could do but the question is if it would give significant savings, especially for the use-case when we use a dump-service, and how prioritized it is.

int0x27 commented 1 year ago

It could be leda specific issue, it is not normal to have such cpu usage on hvac service. Is it cycling through TRANSIENT_FAILURE state?

erikbosch commented 5 months ago

@mikehaller - do you want to keep this one open, we have not done anything for almost a year. I suggest closing it and later create new issues in corresponding repos if needed