edgexfoundry / edgex-go

EdgeX Golang Services Monorepo | Owner: Core/Support WG
Apache License 2.0
1.33k stars 480 forks source link

Delete Device API causes performance issue when lots of events associated to the Device #1286

Closed cloudxxx8 closed 5 years ago

cloudxxx8 commented 5 years ago

When Delete Device API is called, all the associated Events and Readings in core-data are also deleted, but they are deleted one by one. It causes performance issue if there are lots of events exist.

Delete Events one by one: https://github.com/edgexfoundry/edgex-go/blob/master/internal/core/data/event.go#L263-L270

Delete Readings one by one, too: https://github.com/edgexfoundry/edgex-go/blob/master/internal/core/data/event.go#L168-L172

cloudxxx8 commented 5 years ago

add @andyf1967 and @cherrycl for awareness

tsconn23 commented 5 years ago

Yes, this should be handled via goroutines so the response can be returned to the caller and the cleanup can be handled asynchronously.

cloudxxx8 commented 5 years ago

is it possible to create a method which can delete multiple records with one single transaction instead of for loop? Something like: Delete from Event Where Event.devie = ${DeviceName}

cloudxxx8 commented 5 years ago

Since the CPU impact is very high, goroutines might be only resolved the response time, and might be caused memory impact if users is deleting multiple Device with lots of events associated to them

jduranf commented 5 years ago

This issue is related with #465, PR #712 (and #705 and #466) was created in order to fix this issue, but unfortunately wasn't merged.

The solution used in this PR was a mix of using goroutines to avoid blocking and modifying data base to reduce CPU usage.

I think the PR which fix this issue, should also fix #465 and when removing events by age.

cloudxxx8 commented 5 years ago

thanks, @jduranf @brandonforster , do you agree with this resolution?

AnthonyMBonafide commented 5 years ago

I have looked through the referenced PRs and I think the approach is reasonable. However, I am wondering if we can possibly make it more efficient by putting more of the responsibility on the database. I do not know if this is possible, but I am going to do a little exploring.

AnthonyMBonafide commented 5 years ago

I have taken a deeper look into the internals of the deletion process. As of now the logical flow is as follows:

  1. Get associated reports from DB
  2. For each device report: 2a. Delete report from DB 2b. Get device by name from DB 2c. Get device service from DB 2d. Notify device service of report deletion
  3. Delete the device from DB
  4. Post notification to notification-service
  5. Get device service by id
  6. Notify device service of device deletion

I think there are some good opportunities here for some time and memory optimisations. I am thinking that to resolve this issue we restructure the logical flow in a manner which removes any redundant calls to the database and allows for more concurrency where possible. Basically, doing what was done in the PRs that was shared by @jduranf with a few more modifications.

I think we can make this even more performant by adding APIs, to both the DB clients and services, which allow for batch processing, so that we can reduce the amount of processing done at the application level which ends up resulting in a lot of wasted resources with the large amounts of distinct database calls, and marshaling. This could also have the desired effect of giving this delete function some atomicy if we are able to perform these batch operations in transactions.

I am thinking that we do the re-structuring of the logical flow, which reduces redundancies and increases concurrency for this issue, and then create follow-up issues to added the necessary APIs for batch processing which can be introduced at a later time to avoid scope creep. Does this make sense?

tsconn23 commented 5 years ago

@AnthonyMBonafide A flow diagram of your proposal would be helpful prior to writing any code. If you want to create one, I'm happy to review it with you.

SteveOss commented 5 years ago

IMHO all sort of "delete group of events based on xx" should be done by the database, iterating on the client is just a horribly inefficient solution.

AnthonyMBonafide commented 5 years ago

Here are a couple of communication diagrams depicting the logical flow for a delete device request via the API:

Here is how the application handles the delete request today at a high level: DeleteDeviceEdgeXGo1286-Current

This is a proposal to make the endpoint more responsive and better utilise resources: DeleteDeviceEdgeXGo1286-Proposal

#12 in the document would require a new API for the database abstraction to allow for deletion of more than one report

tsconn23 commented 5 years ago

@AnthonyMBonafide I'm a little confused. The issue starts off by discussing how deleting events for a given device takes a long time. Your flow diagrams are related to deletion of a device and the associated device reports. These seem like two different things. If I'm missing something, where is the link?

I also think we should treat these as two different issues -- deleting events by device versus deleting device reports by device. More importance should be given to the first b/c I'm not sure device reports are really used that much.

AnthonyMBonafide commented 5 years ago

You are correct. I started looking into device deletion and came up with those diagrams. I need to re-asses for the correct problem. My apologises.

AnthonyMBonafide commented 5 years ago

As a first stab at this, I have added an interface function to the DBClient which abstracts the process of deleting all the readings associated with a specified device. This will allow implementers to handle the deletion process specific to their type of data-store. For Mongo, this works great and we are able to execute the delete in once shot. However, that is not the case for Redis as there are many sorted sets stored in Redis which are made up of data we would first need to retrieve, like reading name. This being the case for the Redis implementation we need to perform the looping logic as we were before. The only difference is now it is abstracted from the caller(Business logic).

Looking into this issue there is a way to get better performance out of Redis. As suggested by this article we can rename the key associated with the readings, and then in the background perform a data cleanup process. The renaming of the key will have the same effect of being deleted and the change will be available to all other clients and DB connections. We can then delete the data in batches concurrently without having to wait. I was playing around with this idea and the way we are interacting with the data-store seems like it would allow for us to pull this off. Mostly because a lot of the logic will first look for reading IDs using ZRANGE, then query for the actual data using GET. With this structure, the Redis client can still get IDs of readings that have been "deleted", but will not be able to marshal the data from the JSON structure since the key would have been renamed and would not provide a way to obtain the JSON containing the reading information, resulting in an empty result as intended.

I am wondering if this is something we want to tackle and is there any problems I am not seeing with this approach. Any thoughts?

tsconn23 commented 5 years ago

@andresrinivasan Any guidance?

AnthonyMBonafide commented 5 years ago

Also, we would have to slightly tweak the way are are doing reading updates in Redis. Right now we are doing:

  1. Get old reading data from Redis
  2. Marshal to object
  3. Update the relevant fields
  4. Delete old reading
  5. Add new reading

As of now, we are deleting the old reading(# 4) via the common delete operation. This will not work asynchronously as it will introduce a race condition. This means we will have to still perform the old looping logic in that instance, and maybe others. Just a note if we decide to go down that route.

andresrinivasan commented 5 years ago

I need to dig into the actual data details, but there are a few Redis specific options including first getting all the keys and then doing a pipelined delete, renaming keys and setting an expiration, moving the actual delete operation to Lua so everything runs server side.

SteveOss commented 5 years ago

As @andresrinivasan suggests I think the best solution is to implement the equivalent of a stored procedure (Lua script for Redis), that atomically implements the deletion on the server.

AnthonyMBonafide commented 5 years ago

Would a stored procedure still block the client until it completes?

tsconn23 commented 5 years ago

Depends on how it's called. If you take that approach, see what the performance is. If it's longer than the associated blackbox test has set as an expected latency, you could invoke the Delete path in a goroutine.

AnthonyMBonafide commented 5 years ago

Ok, I will take a stab at the stored procedure approach.

AnthonyMBonafide commented 5 years ago

I have roughly implemented both approaches, you can see the implementation details for the Go routine approach here and the Lua script approach here. Both of the implementations need some cleaning up so be warned :)

Here are the results I captured from my machine:

Go routine approach:

100 Readings:
Redis renaming time: 227.14 microseconds
Background process:19.7 ms

1,000 Readings:
Redis renaming time: 1.24 ms
Background process: 96 ms

100,000 Readings:
Redis renaming time: 149.63 ms
Background process: 7.86 s

1,000,000 Readings:
Redis renaming time: 1.37 s
Background process: 1m 20s

Lua script approach:


100 Readings: 999.21 microseconds

1,000 Readings: 8.87 ms

100,000 Readings: 714.81 ms

1,000,000 Readings:  14.55 s

My setup was with an empty database for each metric captured and done with one event that had the number of readings mentioned alongside the metric. After setting up the system, I then issued a delete event API request.

In my opinion, I am leaning towards the GO based approach. It executes faster and I like the idea of keeping the logic in GO rather than adding Lua into the mix, from a maintainability perspective.

Also, for the Lua metrics I leveraged Redis' script caching feature by sending the EVAL command once before capturing the data, I noticed a performance improvement when executing in this fashion. We can also make the Redis approach a bit more performant by leveraging the EVALSHA function which will eliminate the need for sending the script, and having Redis obtain a HASH with the data.

Any objections to going with the Go routine approach?

andresrinivasan commented 5 years ago

Redis is single threaded so all those requests are just going to pile up at Redis or get blocked at the socket until Redis issues a read. My guess is the coroutine is keeping the socket filled and we haven't found the knee in performance where the requests are blocked on Redis.

What if you were to use pipelining to first do all the reading for the keys and then again use pipelining for the delete? I would also look at deleting vs EXPIRE (or perhaps rename + EXPIRE to avoid a race condition).

AnthonyMBonafide commented 5 years ago

@andresrinivasan, Great suggestion and I apologise for missing it the first time you said to use pipelining. I introduced pipelining into the background process so that it pipelines the retrieval of data and then again for the deletion. This made a significant positive impact on the performance. Here is an update for the Go based approach:

100 Readings:
Redis renaming time: 176 microseconds
Background process:6 ms

1,000 Readings:
Redis renaming time: 1.32 ms
Background process: 57 ms

100,000 Readings:
Redis renaming time: 144.22 ms
Background process: 2.58 s

1,000,000 Readings:
Redis renaming time: 1.29 s
Background process: 27.24 s

The time to rename was virtually the same, since it was not touched and already pipelining. I also introduced some batching into the deletion portion of the background process. The metrics above were in batches of 1,000.

As for the difference between EXPIRE, RENAME + EXPIRE, and DEL, the Redis documentation states they operate in O(1), so I think using RENAME is a good option. However, I am not seeing how a RENAME + EXPIRE would allow us to avoid a race condition. Could you explain the laid you are envisioning?

andresrinivasan commented 5 years ago

What I meant by rename + EXPIRE was rename the key then set the key TTL to 1 (== 1 second). From an EdgeX perspective, the rename would be the equivalent of the delete since the key is no longer there. The actual deletion so that we free the memory would happen a moment or so later, but it can't be predicted.

The challenge is we need to send both a rename and an EXPIRE vs a delete command. On the other hand a Lua script that did both would mean a single command is sent. If the delete has a real world cost that is slightly higher, than a rename + EXPIRE, that would be the win.

AnthonyMBonafide commented 5 years ago

I see your point. As of now I am using UNLINK to remove the key in a goroutine. As per the documentation, the UNLINK command acts like the DEL command only it performs the memory reallocation in the background and does not block. As for the renaming portion, that is happening when the client invokes the delete API before kicking off the background deletion process. The reason for the initial rename rather than a delete/reanme+expire is that the value associated with the key is the reading information in JSON format, which contains information like reading name and creation time, which we will need to obtain in order to clean up all other data stored in sorted sets. I hope that makes sense and I am understanding everything correctly