Closed trozet closed 4 years ago
IMO, this complicates the logic at the go-ovn layer. The transaction is fairly orthogonal to the ovsdb updates. The go-ovn layer has to now keep track of the individual transactions and look for those specific commits to have percolated back via the notifier.
In the case of an immediate read following an object creation, the client of the library should rethink whether such a read is indeed necessary and have the logic to handle notifications asynchronously via the notifier interface. Alternately, since the client has context of the object that it's waiting to read from the cache, it can have the wait logic implemented around polling the cache for the right state.
@abhat @trozet I'd agree that if we had a reconcile loop in ovnkube it might be OK to just let the transaction happen and have asynchronous notification of failure, but we don't. Plus, I have to believe that we want indication at the point of the transaction execution of success/failure so that we can do something intelligent there. It feels really weird to me to have a pattern of kicking off a transaction, ignoring the result, and then doing something later in some other codepath when it fails (eg reconcile).
Shouldn't we, regardless of having reconcile loops, still have individual transactions block until they receive success/failure? we don't have to have separate transactions block on each other (though they might due to ovsdb-server ordering) but that's just how things are, no? Perhaps I'm wrong and that's not how the C IDL and/or Python one do it, but that's how I think the CMS should deal with transactions.
@trozet did you or @girishmg confirmed this a real issue with some test? In my opinion there shouldn't be a problem. The transaction execution is blocked and guarantees the cache is already populated when the execute() returns. Take a look at these commits:
https://github.com/eBay/libovsdb/commit/47501432a00f96d1934da6f81e943d25f83d2bbf
https://github.com/cenkalti/rpc2/pull/14/commits/09ab3e86ec385d21488d0315a98c51dd3b71a27f
Please let me know if you still encounter such problem with these commits used.
thanks @hzhou8 ! We did not encounter this problem in testing, it was just through analysis of the code. Glad it is already taken care of.
@hzhou8 this should work fine so long as the ovsdb-server IDL is guaranteeing the ordering of transaction responses vs. cache updates.
Another thing that comes to mind is currently, the Execute()
API of the Client doesn't return to the callers the result of the transactions i.e. []libovsdb.OperationResult
. If it did that, the case we have in ovn-kubernetes for getting the row UUID of the LSP from the create transaction will be simplified in that we won't have to rely at all on the updated cache. Of course the caller need to correlate the individual operations with the result slice returned. Thoughts?
@abhat That may be something to improve in libovsdb. I guess it is straightforward but need some API change? Maybe also consider specifying row UUID directly (so that you can directly use the same UUID of k8s objects), but it requires changes as well. (Now OVSDB supports specifying UUID by client).
Actually, this is something in go-ovn Client library. Look at _, err := odbi.transact(odbi.db, ops...)
in ovnimp.go.
@hzhou8 @trozet @dcbw @abhat the issue was not about whether the transaction is synchronous or not. The issue was that the new object that gets created through ovsdb transact
may not be immediately available in the go-ovn cache.
lsCmd, _ := ovnDBClient.LSAdd("dummy")
lsCmd.Execute()
|--> goovn.transact()
|--> libovsdb.transact()
|---> rpcClient.Call("transact", args, &reply)
** this blocks **
We get the result from the call above and so on and so forth.
The question was if we do this
lsCmd, _ := ovnDBClient.LSAdd("dummy")
getLsCmd, _ := ovnDBClient.LSGet("dummy")
lsCmd.Execute()
lsInfo, err := getLSCmd.Execute()
now the lsInfo
will not have the information about dummy
Logical Switch.
The go-ovn client will not have the information yet because the update
from the ovsdb-server might not have made its way through the library and populated to the library's cache.
@girishmg I thought what @hzhou8 mentioned was that with the ordering of the messages, we call goovn.transact() that wont return until we get a response from the transaction + we are guaranteed an update arrives before the transaction response. Meaning that the time that transact returns to us the cache has already been updated.
@girishmg First of all the example you gave is wrong. LSGet() doesn't returns a command. Since it reads the cache it directly returns the objects. So the right example would be:
lsCmd, _ := ovnDBClient.LSAdd("dummy") lsCmd.Execute() lsList := ovnDBClient.LSGet("dummy")
The RPC message ordering ensures the above code lsList will always get "dummy" (unless the DB transaction failed).
@trozet 's understanding is correct.
@hzhou8 sorry, I wrote that example in a hurry, and I was focusing more on the flow and expectations.
So, I enabled jsornrpc
and ovsdb_jsonrpc_server
logging on the ovsdb-server and ran LRAdd() and LRGet(). This is what I saw in the log files.
So, the update notification is sent even before the reply for the transaction is sent to the client. I didn't know about this. I don't think it is captured in RFC 7047 itself or may be I missed it in the RFC. Or it is ovsdb-server specific implementation.
So, this should be a non-issue. Thanks everyone.
This is indeed not specified in RFC, but an implementation detail that Ben had called out in https://mail.openvswitch.org/pipermail/ovs-discuss/2017-March/043858.html
On Thu, Aug 27, 2020 at 11:48 AM Girish Moodalbail notifications@github.com wrote:
@hzhou8 https://github.com/hzhou8 sorry, I wrote that example in a hurry, and I was focusing more on the flow and expectations.
So, I enabled jsornrpc and ovsdb_jsonrpc_server logging on the ovsdb-server and ran LRAdd() and LRGet(). This is what I saw in the log files.
- 2020-08-27T15:18:52.568Z|00040|jsonrpc|DBG|tcp:192.168.56.1:58180: received request, method="transact", params=["OVN_Northbound",{"uuid-name":"rowf45c0de2_6242_48fc_b208_c4c5a088693c","row":{"name":"gmg0","external_ids":["map",[["test","1"]]]},"op":"insert","table":"Logical_Router"}], id=5
- 2020-08-27T15:18:52.568Z|00041|jsonrpc|DBG|tcp:192.168.56.1:58180: send notification, method="update", params=["",{"Logical_Router":{"d195a4c7-7abf-44b5-8e78-67cb437f36f6":{"new":{"name":"gmg0","_version":["uuid","9b58c9e1-9bca-4fe0-ad0d-936e85a3a131"],"policies":["set",[]],"external_ids":["map",[["test","1"]]],"options":["map",[]],"enabled":["set",[]],"nat":["set",[]],"load_balancer":["set",[]],"ports":["set",[]],"static_routes":["set",[]]}}}}]
- 2020-08-27T15:18:52.568Z|00042|jsonrpc|DBG|tcp:192.168.56.1:58180: send reply, result=[{"uuid":["uuid","d195a4c7-7abf-44b5-8e78-67cb437f36f6"]}], id=5
So, the update notification is sent even before the reply for the transaction is sent to the client. I didn't know about this. I don't think it is captured in RFC 7047 itself or may be I missed it in the RFC. Or it is ovsdb-server specific implementation.
So, this should be a non-issue. Thanks everyone.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/eBay/go-ovn/issues/104#issuecomment-682032908, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAULWM7F4HSKNK36GD32FTSCZ55RANCNFSM4QJYW5XA .
dont mean to keep reopening this, but now I think we are seeing this in logs:
E0828 10:37:10.072481 1 ovn.go:575] failed to get the logical switch port: e2e-statefulset-2897_ss2-1 from the ovn client, error: object not found
I0828 10:37:10.072745 1 kube.go:63] Setting annotations map[k8s.ovn.org/pod-networks:{"default":{"ip_addresses":["10.128.2.112/23"],"mac_address":"0a:58:0a:80:02:70","gateway_ips":["10.128.2.1"],"routes":[{"dest":"10.132.0.0/14","nextHop":"10.128.2.3"}],"ip_address":"10.128.2.112/23","gateway_ip":"10.128.2.1"}}] on pod e2e-statefulset-2897/ss2-1
I0828 10:37:10.073268 1 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"e2e-statefulset-2897", Name:"ss2-1", UID:"9db859ed-cb4c-4fd7-b758-bf137388d53f", APIVersion:"v1", ResourceVersion:"75968", FieldPath:""}): type: 'Warning' reason: 'ErrorAddingLogicalPort' failed to get the logical switch port: e2e-statefulset-2897_ss2-1 from the ovn client, error: object not found
In ovn-kubernetes we create the port and then immediately read: https://github.com/ovn-org/ovn-kubernetes/blob/master/go-controller/pkg/ovn/pods.go#L446
Interestingly all the failures are pods with the same kind of naming scheme...
Interesting. After adding the set_blocking() feature in the rpc2 package I never saw this happen. If the right version is in use, I wonder if it is something else, such as transaction failed but appeared to be successful. Is this reproducible? If so, could you add some logs in the libovsdb's update function and Transaction function respectively, to see if the update notification and transaction response were received in order?
@hzhou8 managed to make it reproducible in the upstream ovn-kubernetes project. I think I see what is happening. We still use the ovn-nbctl cmd to delete the port. So in quick succession of pod delete then add, I think the port is gone in OVN NB, but an LSP get right after for port create hits the cache in go-ovn before it is update that the port is deleted. I modified our code slightly to illustrate it:
I0902 04:08:57.833276 61 pods.go:78] Deleting pod: statefulset-8920/ss-0
I0902 04:08:57.839402 61 ovs.go:162] exec(148): /usr/bin/ovn-nbctl --timeout=15 --if-exists lsp-del statefulset-8920_ss-0
I0902 04:08:57.843758 61 ovs.go:165] exec(148): stdout: ""
I0902 04:08:57.843787 61 ovs.go:166] exec(148): stderr: ""
I0902 04:08:57.843863 61 pods.go:272] Creating logical port for statefulset-8920_ss-0 on switch ovn-control-plane
I0902 04:08:57.843886 61 pods.go:299] TROZET NOT CREATING NEW LSP ALREADY EXISTS FOR: statefulset-8920_ss-0
###^ this means we will not do an lsp-add, but the port is really gone at this point
I0902 04:08:57.847241 61 pods.go:441] TROZET: executing CMDS for port: statefulset-8920_ss-0
I0902 04:08:57.847627 61 ovs.go:162] exec(149): /usr/bin/ovn-nbctl --timeout=15 get logical_switch_port statefulset-8920_ss-0 _uuid
I0902 04:08:57.850821 61 ovs.go:165] exec(149): stdout: ""
I0902 04:08:57.850845 61 ovs.go:166] exec(149): stderr: "ovn-nbctl: no row \"statefulset-8920_ss-0\" in table Logical_Switch_Port\n"
I0902 04:08:57.850853 61 ovs.go:168] exec(149): err: exit status 1
Can confirm the above commit fixes the problem. Sorry for the noise @hzhou8
While integrating go-ovn into ovn-kubernetes @girishmg noticed that we could hit a problem performing an immediate read after a write. For example, if we:
The read can fail because a write transaction never guarantees that the cache is updated when it returns. Therefore we need to wait for an update to happen from ovsdb. One way to fix this is to block returning in the go-ovn transaction until the next update comes back and the cache is synced.
@hzhou8 thoughts?