Closed 1468ca0b-2a64-4fb4-8e52-ea5806644b4c closed 6 years ago
Created by: andriytk
The performance regression could be caused by the fact that the messages are not taken out of the cep buffer when guard is not passing. Which means that the same messages will be checked again and again many times.
extractMatchMsg :: Serializable a
=> (a -> g -> l -> Process (Maybe b))
-> l
-> Buffer
-> State.StateT g Process (Maybe (Extraction b))
extractMatchMsg p l buf = go (-1)
where
go lastIdx =
case bufferGetWithIndex lastIdx buf of
Just (newIdx, a, newBuf) -> do
res <- State.get >>= \g -> lift (Catch.mask_ $ trySome $ p a g l)
case res of
Left _ -> return Nothing
Right Nothing -> go newIdx -- HERE!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Right (Just b) ->
let ext = Extraction
{ _extractBuf = newBuf
, _extractMsg = b
, _extractIndex = newIdx
}
in return $ Just ext
_ -> return Nothing
trySome :: Process a -> Process (Either Catch.SomeException a)
trySome = Catch.try
Created by: andriytk
I struggled with this, oh yes...
Created by: andriytk
I'm not sure it is because of this...
Created by: andriytk
I know, I tried different ones and chose the one which looked better here for me.
Created by: andriytk
Yes. It might be clear from the context, as for me.
Created by: vvv
any
is lazy, foldl'
is strict. This explains the increased number of allocations and performance degradation.
Created by: vvv
No braces are needed around FieldRec l
.
[nit] Commonly used indentation style is to put =>
and ->
under ::
, unless type signature fits 80 characters.
nGuard :: HasCallStack
=> HAEvent InternalObjectStateChangeMsg
-> g -> FieldRec l -> Process (Maybe UUID)
or
nGuard :: HasCallStack
=> HAEvent InternalObjectStateChangeMsg
-> g
-> FieldRec l
-> Process (Maybe UUID)
Created by: vvv
What does "Ns" stand for? Notifications?
s/nGuard/notificationGuard/ -- or simply `guard`
s/pendingNs/pendingNotifications/
Created by: vvv
s/iosc/changes/
Created by: vvv
You can output error message without trace
: an IO
computation can be lifted to Process
with liftIO
.
I guess the proper way is to use log<Something>
API.
Created by: vvv
We should avoid leaving trace
s in production code.
Created by: andriytk
Actually, as it appears, the cluster startup with 60 clovis-apps shows worse results with this patch for some reason: it allocates more memory and takes more time to start. Compare:
Without the patch:
Jul 04 10:57:03 sage73 time[8140]: 70,816,572,032 bytes allocated in the heap
Jul 04 10:57:03 sage73 time[8140]: 1,158,958,224 bytes copied during GC
Jul 04 10:57:03 sage73 time[8140]: 50,838,744 bytes maximum residency (26 sample(s))
Jul 04 10:57:03 sage73 time[8140]: 16,286,888 bytes maximum slop
Jul 04 10:57:03 sage73 time[8140]: 216 MB total memory in use (0 MB lost due to fragmentation)
Jul 04 10:57:03 sage73 time[8140]: Tot time (elapsed) Avg pause Max pause
Jul 04 10:57:03 sage73 time[8140]: Gen 0 2011 colls, 0 par 2.187s 2.483s 0.0012s 0.0088s
Jul 04 10:57:03 sage73 time[8140]: Gen 1 26 colls, 0 par 0.613s 0.736s 0.0283s 0.0511s
Jul 04 10:57:03 sage73 time[8140]: TASKS: 19 (13 bound, 6 peak workers (6 total), using -N1)
Jul 04 10:57:03 sage73 time[8140]: SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
Jul 04 10:57:03 sage73 time[8140]: INIT time 0.002s ( 0.004s elapsed)
Jul 04 10:57:03 sage73 time[8140]: MUT time 90.489s (448.165s elapsed)
Jul 04 10:57:03 sage73 time[8140]: GC time 2.800s ( 3.219s elapsed)
Jul 04 10:57:03 sage73 time[8140]: RP time 0.000s ( 0.000s elapsed)
Jul 04 10:57:03 sage73 time[8140]: PROF time 0.000s ( 0.000s elapsed)
Jul 04 10:57:03 sage73 time[8140]: EXIT time 0.000s ( 0.040s elapsed)
Jul 04 10:57:03 sage73 time[8140]: Total time 93.298s (451.428s elapsed)
Jul 04 10:57:03 sage73 time[8140]: Alloc rate 782,602,823 bytes per MUT second
Jul 04 10:57:03 sage73 time[8140]: Productivity 97.0% of total user, 99.3% of total elapsed
Jul 04 10:57:03 sage73 time[8140]: gc_alloc_block_sync: 0
Jul 04 10:57:03 sage73 time[8140]: whitehole_spin: 0
Jul 04 10:57:03 sage73 time[8140]: gen[0].sync: 0
Jul 04 10:57:03 sage73 time[8140]: gen[1].sync: 0
Jul 04 10:57:04 sage73 time[8140]: 93.42user 26.85system 7:32.39elapsed 26%CPU (0avgtext+0avgdata 567364maxresident)k
Jul 04 10:57:04 sage73 time[8140]: 0inputs+480560outputs (97major+201945minor)pagefaults 0swaps
$ cat /var/log/halon.decision.log | awk '/^2018/ {t=$2} /notifyMeroAsynch.epoch/ {printf("%s: %s\n", t, $0)} /markNotificationDelivered/ {m=$0} /tryCompleteStateDiff.remaining.*\[\]/ {printf("%s: %s\n", t, m)}'
09:49:40.880283: §DEBUG notifyMeroAsynch.epoch => 0 notifyMeroAsynch.nodes => [] ()
09:49:40.880283:
09:49:44.931221: §DEBUG notifyMeroAsynch.epoch => 1
...
09:51:59.64498: §Invoked: markNotificationDelivered epoch => 125
09:52:00.301219: §Invoked: markNotificationDelivered epoch => 126
09:52:00.451257: §Invoked: markNotificationDelivered epoch => 127
With the patch:
Jul 04 10:45:43 sage73 time[6343]: 130,703,798,736 bytes allocated in the heap
Jul 04 10:45:43 sage73 time[6343]: 1,044,565,280 bytes copied during GC
Jul 04 10:45:43 sage73 time[6343]: 51,887,736 bytes maximum residency (26 sample(s))
Jul 04 10:45:43 sage73 time[6343]: 16,049,576 bytes maximum slop
Jul 04 10:45:43 sage73 time[6343]: 198 MB total memory in use (0 MB lost due to fragmentation)
Jul 04 10:45:43 sage73 time[6343]: Tot time (elapsed) Avg pause Max pause
Jul 04 10:45:43 sage73 time[6343]: Gen 0 3845 colls, 0 par 2.446s 2.583s 0.0007s 0.0068s
Jul 04 10:45:43 sage73 time[6343]: Gen 1 26 colls, 0 par 0.582s 0.620s 0.0239s 0.0392s
Jul 04 10:45:43 sage73 time[6343]: TASKS: 19 (13 bound, 6 peak workers (6 total), using -N1)
Jul 04 10:45:43 sage73 time[6343]: SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
Jul 04 10:45:43 sage73 time[6343]: INIT time 0.002s ( 0.004s elapsed)
Jul 04 10:45:43 sage73 time[6343]: MUT time 115.395s (207.389s elapsed)
Jul 04 10:45:43 sage73 time[6343]: GC time 3.028s ( 3.204s elapsed)
Jul 04 10:45:43 sage73 time[6343]: RP time 0.000s ( 0.000s elapsed)
Jul 04 10:45:43 sage73 time[6343]: PROF time 0.000s ( 0.000s elapsed)
Jul 04 10:45:43 sage73 time[6343]: EXIT time 0.020s ( 0.023s elapsed)
Jul 04 10:45:43 sage73 time[6343]: Total time 118.455s (210.619s elapsed)
Jul 04 10:45:43 sage73 time[6343]: Alloc rate 1,132,666,171 bytes per MUT second
Jul 04 10:45:43 sage73 time[6343]: Productivity 97.4% of total user, 98.5% of total elapsed
Jul 04 10:45:43 sage73 time[6343]: gc_alloc_block_sync: 0
Jul 04 10:45:43 sage73 time[6343]: whitehole_spin: 0
Jul 04 10:45:43 sage73 time[6343]: gen[0].sync: 0
Jul 04 10:45:43 sage73 time[6343]: gen[1].sync: 0
Jul 04 10:45:44 sage73 time[6343]: 119.43user 10.05system 3:31.62elapsed 61%CPU (0avgtext+0avgdata 542504maxresident)k
Jul 04 10:45:44 sage73 time[6343]: 32inputs+453664outputs (92major+193443minor)pagefaults 0swaps
$ cat /var/log/halon.decision.log | awk '/^2018/ {t=$2} /notifyMeroAsynch.epoch/ {printf("%s: %s\n", t, $0)} /markNotificationDelivered/ {m=$0} /tryCompleteStateDiff.remaining.*\[\]/ {printf("%s: %s\n", t, m)}'
09:42:21.818555: §DEBUG notifyMeroAsynch.epoch => 0 notifyMeroAsynch.nodes => [] ()
09:42:21.818555:
09:42:25.861871: §DEBUG notifyMeroAsynch.epoch => 1
...
09:45:01.378134: §Invoked: markNotificationDelivered epoch => 125
09:45:01.599069: §Invoked: markNotificationDelivered epoch => 126
09:45:01.737431: §Invoked: markNotificationDelivered epoch => 127
So I would not recommend to land the patch for now.
Created by: andriytk
Implemented phase switching guard for notifier which checks whether graph change notification confirmation is the one it is waiting for. This allows to avoid notifier odd phase switches, especially on a large cluster configurations when there may be a large number of notifier instances waiting for their notification confirmations.