Closed naved001 closed 6 years ago
Hrm. Can you see what the IDs that are being used in the database are?
Well, I can only see the ids if I stop the networking daemon and let the queue build. But that's not the problem scenario.
When the networking daemon is running, I can't see the ids because the database is cleared pretty quickly. Perhaps, I can put a print statement in the daemon to print the ID it's handling?
Turns out the version of HIL on which I ran into this issue is before we fixed https://github.com/CCI-MOC/hil/issues/314 via #856
Though, this problem was not discussed on that issue (or discovered at that point).
I set everything up again with the latest version of HIL and this time it ran without any issues. I used the same powerconnect switch.
Can you see what the IDs that are being used in the database are?
On the older version of HIL (the problem setup), I put in a statement to print out action.id and it was in fact skipping some IDs (399, 401, 402; skipped 400). Looking at the acces logs at confirms that the request for port_revert on all 3 ports was passed correctly.
10.0.0.16 - - [09/Nov/2017:20:04:41 +0000] "GET /node/slurm-compute1 HTTP/1.1" 200 185 "-" "python-requests/2.18.4"
10.0.0.16 - - [09/Nov/2017:20:04:41 +0000] "POST /switch/powerconnect/port/gi1/0/10/revert HTTP/1.1" 200 - "-" "python-requests/2.18.4"
10.0.0.16 - - [09/Nov/2017:20:04:42 +0000] "GET /node/slurm-compute2 HTTP/1.1" 200 185 "-" "python-requests/2.18.4"
10.0.0.16 - - [09/Nov/2017:20:04:43 +0000] "POST /switch/powerconnect/port/gi1/0/12/revert HTTP/1.1" 200 - "-" "python-requests/2.18.4"
10.0.0.16 - - [09/Nov/2017:20:04:43 +0000] "GET /node/slurm-compute3 HTTP/1.1" 200 185 "-" "python-requests/2.18.4"
10.0.0.16 - - [09/Nov/2017:20:04:44 +0000] "POST /switch/powerconnect/port/gi1/0/23/revert HTTP/1.1" 200 - "-" "python-requests/2.18.4"
10.0.0.16 - - [09/Nov/2017:20:04:44 +0000] "GET /node/slurm-compute1 HTTP/1.1" 200 185 "-" "python-requests/2.18.4"
10.0.0.16 - - [09/Nov/2017:20:04:45 +0000] "POST /switch/powerconnect/port/gi1/0/10/revert HTTP/1.1" 200 - "-" "python-requests/2.18.4"
10.0.0.16 - - [09/Nov/2017:20:04:45 +0000] "GET /node/slurm-compute2 HTTP/1.1" 200 185 "-" "python-requests/2.18.4"
10.0.0.16 - - [09/Nov/2017:20:04:46 +0000] "POST /switch/powerconnect/port/gi1/0/12/revert HTTP/1.1" 200 - "-" "python-requests/2.18.4"
10.0.0.16 - - [09/Nov/2017:20:04:46 +0000] "GET /node/slurm-compute3 HTTP/1.1" 200 185 "-" "python-requests/2.18.4"
10.0.0.16 - - [09/Nov/2017:20:04:47 +0000] "POST /switch/powerconnect/port/gi1/0/23/revert HTTP/1.1" 409 69 "-" "python-requests/2.18.4"
DEBUG:passlib.registry:registered 'sha512_crypt' handler: <class 'passlib.handlers.sha2_crypt.sha512_crypt'>
============= FROM NETWORKING DAEMON =============
399
DEBUG:hil.ext.switches.dell:Logged in to switch <hil.ext.switches.dell.PowerConnect55xx object at 0x3078d90>
DEBUG:hil.ext.switches.dell:Sending to switch <hil.ext.switches.dell.PowerConnect55xx object at 0x3078d90>: 'config'
DEBUG:hil.ext.switches.dell:Sending to switch <hil.ext.switches.dell.PowerConnect55xx object at 0x3078d90>: u'int gi1/0/10'
DEBUG:hil.ext.switches.dell:Sending to switch <hil.ext.switches.dell.PowerConnect55xx object at 0x3078d90>: 'sw trunk allowed vlan none'
DEBUG:hil.ext.switches.dell:Sending to switch <hil.ext.switches.dell.PowerConnect55xx object at 0x3078d90>: 'sw trunk native vlan none'
DEBUG:hil.ext.switches.dell:Sending to switch <hil.ext.switches.dell.PowerConnect55xx object at 0x3078d90>: 'exit'
DEBUG:hil.ext.switches.dell:Sending to switch <hil.ext.switches.dell.PowerConnect55xx object at 0x3078d90>: 'exit'
DEBUG:hil.ext.switches.dell:Sending to switch <hil.ext.switches.dell.PowerConnect55xx object at 0x3078d90>: 'exit'
DEBUG:hil.ext.switches._dell_base:Logged out of switch <hil.ext.switches.dell.PowerConnect55xx object at 0x3078d90>
============= FROM NETWORKING DAEMON =============
401
DEBUG:hil.ext.switches.dell:Logged in to switch <hil.ext.switches.dell.PowerConnect55xx object at 0x306a590>
DEBUG:hil.ext.switches.dell:Sending to switch <hil.ext.switches.dell.PowerConnect55xx object at 0x306a590>: 'config'
DEBUG:hil.ext.switches.dell:Sending to switch <hil.ext.switches.dell.PowerConnect55xx object at 0x306a590>: u'int gi1/0/23'
DEBUG:hil.ext.switches.dell:Sending to switch <hil.ext.switches.dell.PowerConnect55xx object at 0x306a590>: 'sw trunk allowed vlan none'
DEBUG:hil.ext.switches.dell:Sending to switch <hil.ext.switches.dell.PowerConnect55xx object at 0x306a590>: 'sw trunk native vlan none'
DEBUG:hil.ext.switches.dell:Sending to switch <hil.ext.switches.dell.PowerConnect55xx object at 0x306a590>: 'exit'
DEBUG:hil.ext.switches.dell:Sending to switch <hil.ext.switches.dell.PowerConnect55xx object at 0x306a590>: 'exit'
============= FROM NETWORKING DAEMON =============
402
DEBUG:hil.ext.switches.dell:Sending to switch <hil.ext.switches.dell.PowerConnect55xx object at 0x306a590>: 'config'
DEBUG:hil.ext.switches.dell:Sending to switch <hil.ext.switches.dell.PowerConnect55xx object at 0x306a590>: u'int gi1/0/10'
DEBUG:hil.ext.switches.dell:Sending to switch <hil.ext.switches.dell.PowerConnect55xx object at 0x306a590>: 'sw trunk allowed vlan none'
DEBUG:hil.ext.switches.dell:Sending to switch <hil.ext.switches.dell.PowerConnect55xx object at 0x306a590>: 'sw trunk native vlan none'
DEBUG:hil.ext.switches.dell:Sending to switch <hil.ext.switches.dell.PowerConnect55xx object at 0x306a590>: 'exit'
DEBUG:hil.ext.switches.dell:Sending to switch <hil.ext.switches.dell.PowerConnect55xx object at 0x306a590>: 'exit'
DEBUG:hil.ext.switches.dell:Sending to switch <hil.ext.switches.dell.PowerConnect55xx object at 0x306a590>: 'exit'
DEBUG:hil.ext.switches._dell_base:Logged out of switch <hil.ext.switches.dell.PowerConnect55xx object at 0x306a590>
Now I am not sure if the refactoring in #856 took care of this issue I am describing here but I couldn't reproduce this with the newer version of HIL. I tested like 40 revert port operations (cycling on 4 switchports) and it ran without a hitch. We still don't know the issue with the old hil setup; so I am unsure if I should close this issue or keep it open.
Note to self: always ask if bugs can be reproduced on master.
Hypothesis: previously, this statement was deleting the actions that were being skipped:
https://github.com/CCI-MOC/hil/pull/856/files#diff-d550c3bbac185ae67f8c46d3b6b44586L105
The current implementation makes that impossible because it specifically deletes the actions it saw, rather than just saying "delete all of them."
This would suggest that the transaction isolation we're getting isn't as strong as we expect; we probably should make sure we're clear on what the database actually provides. I think my hypothesis is consistent with at least cursor stability, which is one of the weakest guarantees that gets called a "transaction." Need to brush up on some of the theory to remind myself how things work.
I'm going to look in to what guarantees postgres & sqlite provide. If those are consistent with my hypothesis I'll close this, since it means the bug was in fact fixed by #856. But we should also document what developers are allowed to assume.
The current implementation makes that impossible because it specifically deletes the actions it saw, rather than just saying "delete all of them."
I agree.
This would suggest that the transaction isolation we're getting isn't as strong as we expect;
From the looks of it, I am pretty sure we are having 2 independent transactions here. First one to read actions which only returns actions committed at that time, and the next one to delete, which deletes everything committed at the time of read, which is why actions added later were skipped (because they just got deleted).
Transaction isolation would mean working locking a row (1 row at a time with cursor stability) in a single transaction, is that correct? Please correct me if I am wrong.
Here is the relevant taxonomy:
https://en.wikipedia.org/wiki/Isolation_%28database_systems%29#Isolation_levels
IIRC, cursor stability is what wikipedia is calling "Read committed." In this case, the DELETE in the old implementation would be doing (essentially) a second read, so would not necessarily be consistent with the first read, and thus may delete a different set of objects from what we iterated over. So that would be consistent with the behavior we saw.
I feel like I investigated this before and found that Flask-SQLAlchemy implicitly makes everything between calls to db.commit/close/rollback a transaction on db.session, but perhaps worth double-checking.
After discussing this with Naved and reviewing the code, the newer version explicitly deletes the actions as it is processing each action. Yes, this uses more SQL statements, and more commits. However it doesn't depend on how Flash-SQLAlchemy implements transactions in a db.session, which seems to be the source of confusion.
If this were process large datasets, then spending ~30 milliseconds per action could be source of inefficiency. But, in this case, given the relative small size of the dataset and the infrequency of the operations, the benefits of having a robust solution out weights the optimizations attempted in the old implementation.
Furthermore, the optimizations that I would recommend in the future would be more along the lines of getting n actions (ideally with a row level lock - but a table lock will also do), process the n actions deleting as you go, and commit after processing the block (relating either the row level lock or the table level lock). This way you don't lock out other processes.
Although it is important to investigate Flask-SQLAlchemy, I feel this investigation is beyond the scope of this issue.
Also worth noting the cost of hitting the switch is likely dramatically higher than anything we're doing with the database. I think any kind of optimization work here is premature.
I think we should close this one; the actual problem is fixed, and I agree with @robbaronbu that working out the semantics of Flask-SQLAlchemy is out of scope.
Thanks @robbaronbu for looking into this! I am gonna close this now.
This has me worried.
I am using a dell power-connect switch. The networking daemon is set to sleep for 2 seconds before it checks the journal again.
I have 3 ports on that switch. 1st port = gi1/0/10 2nd port = gi1/0/12 3rd port = gi1/0/23
I have a script that asks HIL to perform port_revert on those 3 ports. So for some reason, the networking server sometimes skips the 2nd port, and sometimes 2nd and 3rd both are skipped. Note: It logs out of the switch after working on the 1st port, and then logs back again for the 3rd port.
At the end of this, the networking action table was cleared up. Show node command showed that networks were still attached to my node whose port was skipped.
If I stop the networking daemon, and let the networking action table get filled with the 3 actions and then start the networking daemon; then it correctly performs revert port on all 3 ports (and there's no disconnect from the switch, it uses the same cached session object I guess). Alternatively, if I set the sleep time for the networking action daemon to 10 seconds, that also seems to work. Logs here.
Surely this cannot be the expected behaviour.