MicroTrendsLtd / NinjaTrader8

NinjaTrader8 Components Strategies and Trading tools
MIT License
72 stars 18 forks source link

Commit released on 2020-12-02 consistently Crashes NT8 #12

Closed jmscraig closed 3 years ago

jmscraig commented 3 years ago

Hi Tom,

I download the Zip file of today's release from Here https://github.com/MicroTrendsLtd/NinjaTrader8

Crashed NT8 to a forced exit on both my VPS and local development machine each three times.

I made no changes to the code.

The only change I made to the Property settings in the UI was, after a few crashes I turned DEBUG trace so I could gather crash state information you.

I looked at todays updates in the code but did not see anything as an obvious root cause. Felt like either new workflow or Queuing .. or anyway was a lot less bloody on order exceptions and Cancel Pending rejections than the issues I have been working on. The Stack Overflow Messages are also new.

Given the Stackoverflow messages maybe an unusually long running loop in the workflow?

Inconsistent final Workflow state but consistently JUST AFTER PositionClose().

Let me know if you want the detailed logs.

NT8 Crash-2020-12-02

NT8 Crash-2020-12-02-Bottom-Half

image

NT8 Log file that goes with that image above. image

And another image

From one of the crashes on the VPS, the snip was all I could grab from this crash before NT8 said "See Ya!" image

And the trace log to go with it.. image

MicroTrendsTom commented 3 years ago

hi yes i have had this reported also . working on this now -the good thing is this is replicable on multiple machines

MicroTrendsTom commented 3 years ago

"The Stack Overflow Messages are also new. Given the Stackoverflow messages maybe an unusually long running loop in the workflow?"
This was the Print method erroring and then catching the exception and then printing it again ... instead of using base. print...

i am yet to reproduce this and cannot as yes see any stackoveflow etc.,. are you seeing new?

MicroTrendsTom commented 3 years ago

Inconsistent final Workflow state but consistently JUST AFTER PositionClose(). ok will check this...

Let me know if you want the detailed logs.

Yes please that would be great -the text files etc

MicroTrendsTom commented 3 years ago

time for a history code compare - im also using TFS on my derived project that tests this so i should be able to make some ideas there even rollbackl and add each piece but -currently not replicated the error. maybe i need more vigorous testing etc

MicroTrendsTom commented 3 years ago

so after doing a code compare - Diff in team explorer - on the current version and prior the main change is locking, concurrent queue, OnOrderUpdate large grain lock, fine grain locks,

Partial Rollback 1.rollback the concurrent q to a fine grain lock -
2.refactor the order of events to match the prior to make file diff easy 3.OnExec-- else if (execution.Order == orderEntryPrior) - Rollback

  1. OnPositionUpdate Rollback +others TBD

New

  1. IsUseSignalQFallbackForSignals add a parameter to not use the Q at all even if a trade signal is missed due to this.
  2. IsSimplePrintMode - add a simple print mode -
  3. refactor ProcessTradeEventQueue with finegrain locks
  4. public override void OnCalculateMinMax() - catch erroneous intermittent excpetion from base class 5.OnBarUpdate Try catch

Leave as is

  1. New IsFlattenOnTransition and TradeWorkflowState leave in -disable/bypass
  2. if (ATSAlgoSystemState == AlgoSystemState.HisTradeRT) and state setting onExec
  3. leave the finegrain locks in
  4. OneExec removal of unused local entryOrderInFlightCollision = true;
  5. lock (onOrderUpdateLockObject)
  6. Print stackoverflow error avoidance on catch

then redeploy

MicroTrendsTom commented 3 years ago

Seems to have rolled back the issue as we can see in this test: https://www.screencast.com/t/BxfpbZQ0Dn8h So now i will leave that to soak test.. i might even film it and leave it all running so i can see if it crashes etc...and view it hmmmm

jmscraig commented 3 years ago

ATS.NT8.20201202.Trace.txt

Morning.

Attached is the most recent DEBUG from the Dev Machine.

Github is a little grumpy with the request to load the file from the VPS, Lol

image

The detail of the end of the trace file from the VPS is already pasted in above. .. I trimmed the DEBUG Log from the VPS

ATS.NT8.20201202-trimmed.Trace.txt

jmscraig commented 3 years ago

"Leave as is New IsFlattenOnTransition and TradeWorkflowState leave in -disable/bypass if (ATSAlgoSystemState == AlgoSystemState.HisTradeRT) and state setting onExec leave the finegrain locks in OneExec removal of unused local entryOrderInFlightCollision = true; lock (onOrderUpdateLockObject) Print stackoverflow error avoidance on catch"

I have not tracked which of the commits came from yesterday.. For sure in my experience the issue surfaced from commits yesterday and not the day before, very crisp difference like turning on a light switch in a dark room.

"Seems to have rolled back the issue as we can see in this test:"

Good. I will download and test it

"i might even film it and leave it all running so i can see if it crashes etc...and view it hmmmm "

Not sure I would invest the time just to film.. What I saw repeated was one of two:

  1. Everything is running grand and then "Poof", he screen is clear and NT8 is just gone. No shutdown time, no short wailing groan of death, just gone.
  2. A stackoverflow exception pops up and you see exactly what I pasted in the images in the first post.
jmscraig commented 3 years ago

Speeding the test cycle... When the market was slow it took 3 - 20 minutes to surface the error for me when running the ATSSMACrossOver example on a ES 6 tick chart, SMA Fast set to 6 and SMA Slow set to 12.

MicroTrendsTom commented 3 years ago

Not sure I would invest the time just to film.. What I saw repeated was one of two:

  1. Everything is running grand and then "Poof", he screen is clear and NT8 is just gone. No shutdown time, no short wailing groan of death, just gone.
  2. A stackoverflow exception pops up and you see exactly what I pasted in the images in the first post.

This is todays test findings?

Ok i cannot seem to get this so i will run this with visual studio open in debug mode attached to NT and then we should see something more helpful. .the question is how are you testing and what settings etc?

ATSSMACrossOver example on a ES 6 tick chart, SMA Fast set to 6 and SMA Slow set to 12.

ok will try this Ideas on what it is:

  1. Property get/set accessor typo
  2. self referencing looped trade workflow action with no max iteration etc

OnBarUpdate -to remove line 738 //belt and braces check

jmscraig commented 3 years ago

"self referencing looped trade workflow action with no max iteration etc"

or similar .. infinite "retrys" ?

MicroTrendsTom commented 3 years ago

im testing with the same system and it does behave differently to the one i was using derived from it... some progress by elimination atleast

MicroTrendsTom commented 3 years ago

A job for ANTS profiler perhaps - i still cant get a SO as yet...

MicroTrendsTom commented 3 years ago

is it possible if you are going to run to also run ANTS profiler -the idea being that it can tell which method or call was repetitive etc so i have read :-)

MicroTrendsTom commented 3 years ago

i also thought if i ran VStudio in debug attached to NT8 in fact it would popup a SO notification/ halt and potentially provide some insight -etc so i got that running etc and ANTS

tomorrow will also use file diff compare on historical versions to eek out what changed...etc that might have done it etc

i pushed what im using some small edits

jmscraig commented 3 years ago

"im testing with the same system and it does behave differently to the one i was using derived from it... some progress by elimination atleast"

What you released today in response to this issue has eleminated the new SO crash problem for me.. Still auto-Disabling due to over-eager CancelOrder - Unable to verify Flatten but (so back to yesterday's state) but the crash problem is gone.

jmscraig commented 3 years ago

"Still auto-Disabling due to over-eager CancelOrder - Unable to verify Flatten ... " I have three conceptually simple fixes for this but am struggling to get the most important working. Will post on the CancelPending Issue for your review and input.

MicroTrendsTom commented 3 years ago

ok progress - ty awesome! I saw this when the system NT8 had disconnected the code was stuck here and the chart and strategy frozen etc. image

MicroTrendsTom commented 3 years ago

ANS told me nothing towards the SO - but is incredible what it can see image

jmscraig commented 3 years ago

Lots of good detail in ANTS.. we could get lost in there for weeks just rooting around through all the good info Lol..

jmscraig commented 3 years ago

"I saw this when the system NT8 had disconnected the code was stuck here (Account.Cancel.OrdersActive) and the chart and strategy frozen etc.

One of mine froze a few minutes ago but I don't know where. After so many hours without a freeze I was running without logging to see if it could better stay up with incoming ticks.

jmscraig commented 3 years ago

Any conclusions from the freeze at Account.Cancel(OrdersActive)?

Might be a deadlock against the lock on OrdersActive type object ?

Might be the call to Account.Cancel() ? I have seen NT recommendations to lock (Account.All) for read access but I sure we would not need that to execute this function call.

jmscraig commented 3 years ago

A ConcurrentDictionary https://referencesource.microsoft.com/#mscorlib/system/Collections/Concurrent/ConcurrentDictionary.cs,2e5ef5704344a309

... would definitely allow reads and writes from multiple threads with less chance of deadlocks.
ConcurrentBag is a bit slower (not slower than a deadlock). By default internally ConcurrentBag creates two similar Lists after every edit so that two threads can read at the same time without locking.

jmscraig commented 3 years ago

CurentBag is a little tricky...


"ConcurrentBag has been designed for when most of the time an item is dequeued by the thread that enqueued it" (often, but not always that is us) "and suffers from important performance issues if used otherwise.

ConcurrentBag is optimized for scenarios where same thread act as producer and consumer. That means same thread is adding and retrieving the data.

http://dotnetpattern.com/csharp-concurrentbag

jmscraig commented 3 years ago

Checking around to get current.. ConcurrentDictionary is still a popular answer for this type of need (if you have decided you need to replace the Lock + List approach)

The surprise was to see how often the faster than a Dictionary ConcurrentQueue was pitched as developers frequent go to by default answer

MicroTrendsTom commented 3 years ago

yes i see concurrentDict has good press so the idea to replace the lists etc ?

Concurrent Q i removed... last commit rolledback. simply by enQ in backtest to the Queue<> instance caused a no return from the call from the method - so that is worrying indeed... we only use it in realtime - but what i am reading is one thread to enQ and 1 thread to dQ - we cant have that level of control in NT8 it will hang trying.. .so a finegrained lock Queue is voted as favourite by MS docs for that usage so perhaps that is ok as is.

For concurrent Dict for sure we can add that and see if NT8 work with it... in backtest and realtime.... that is the test. What is the proposal? we would need to have a very isolated change to allow rollback to prior in case backtest cant handle it etc... or we use a different model for realtime etc

jmscraig commented 3 years ago

Sounds good. " What is the proposal? we would need to have a very isolated change " starting with one list...unfortunately the right one to try is also the list with the most required code changes

MicroTrendsTom commented 3 years ago

Any conclusions from the freeze at Account.Cancel(OrdersActive)? Might be a deadlock against the lock on OrdersActive type object ?
Might be the call to Account.Cancel() ? I have seen NT recommendations to lock (Account.All) for read access but I sure we would not need that to execute this function call.

i am refactoring the CAncelAllOrders to allow some flexibility and experimentation of modes as you are right its cancel overkill!!!!

MicroTrendsTom commented 3 years ago

i have also just refactored the naming of all OrderRelated methods and props - i think i got them all etc so i need a while to check that all in

jmscraig commented 3 years ago

So.... all the concurrent class method actually create copies of lists and use locks ... We could mimic the same.

After every Add or Remove session .Clear() the second list and use the for loop against the List[index] to copy item by item into the second list and segment who would read the original and who would read the second cutting the locking interdependencies in half, etc.

Had to mention this because Lists are better than Dictionaries at "contains" calls.

MicroTrendsTom commented 3 years ago

So.... all the concurrent class method actually create copies of lists and use locks ... We could mimic the same. yes i think it will be much more performant also

jmscraig commented 3 years ago

"Had to mention this because Lists are better than Dictionaries at "contains" calls" ... and this would be much faster and less dramatic to implement for testing and forever.

MicroTrendsTom commented 3 years ago

100% agreed - less impact etc etc etc i will get ready to check this latest commit in - i think we might just abandon the usage of Account.Cancel(OrdersActive)? in preferency for Account.CancelAllOrders(this.Instrument); in realtime as default

jmscraig commented 3 years ago

and this bit "[keep reusing the same seconds list by using] a for loop against the List[index] to copy item by item into the second list"

will keep us way happier with Garbage Collection pressure than just using a = New to keep creating new lists

jmscraig commented 3 years ago

abondon using Account.Cancel(OrdersActive)? in preferency for Account.CancelAllOrders(this.Instrument)

Honesty.. yes.. that puts a lot of pressure on OrdersActive. So Yes

MicroTrendsTom commented 3 years ago

and this bit "[keep reusing the same seconds list by using] a for loop against the List[index] to copy item by item into the second list"
Yes totally agreed - low latency patterns avoid create and destroy create etc... perhaps there can there be an AddRange? also... etc

MicroTrendsTom commented 3 years ago

ok so we go this,... basically if (IsHistoricalTradeOrPlayBack) if(IsOrderActiveCanCancel(orderStop1)) CancelOrder(orderStop1);

plus this if (!oCOLetCancel) {//these dont need to cancel if order stop cancel called if (IsOrderActiveCanCancel(orderTarget1)) CancelOrder(orderTarget1);

if (isRealtime) Account.CancelAllOrders(this.Instrument);

MicroTrendsTom commented 3 years ago

im not aware of a historical order collection must be there somewhere in private fields

jmscraig commented 3 years ago

ok so we go this,... basically if (IsHistoricalTradeOrPlayBack) if(IsOrderActiveCanCancel(orderStop1)) CancelOrder(orderStop1);

if (isRealtime) Account.CancelAllOrders(this.Instrument);

Is a little trickier .. you have another (second) bool in the first line now.

if (IsHistoricalTradeOrPlayBack || !IsOrderCancelInspectEachOrDoBatchCancel)
MicroTrendsTom commented 3 years ago
          if (IsHistoricalTradeOrPlayBack)
            {
                if (IsOrderCancelInspectEachOrDoBatchCancel)
                {

                    if(IsOrderActiveCanCancel(orderStop1)) CancelOrder(orderStop1);
                    if (IsOrderActiveCanCancel(orderStop2)) CancelOrder(orderStop2);
                    if (IsOrderActiveCanCancel(orderStop3)) CancelOrder(orderStop3);
                    if (IsOrderActiveCanCancel(orderStop4)) CancelOrder(orderStop4);

                    if (!oCOLetCancel)
                    {//these dont need to cancel if order stop cancel called
                        if (IsOrderActiveCanCancel(orderTarget1)) CancelOrder(orderTarget1);
                        if (IsOrderActiveCanCancel(orderTarget2)) CancelOrder(orderTarget2);
                        if (IsOrderActiveCanCancel(orderTarget3)) CancelOrder(orderTarget3);
                        if (IsOrderActiveCanCancel(orderTarget4)) CancelOrder(orderTarget4);
                    }

                    if (IsOrderActiveCanCancel(orderEntry)) CancelOrder(orderEntry);
                    if (IsOrderActiveCanCancel(orderEntryOCOLong)) CancelOrder(orderEntryOCOLong);
                    if (IsOrderActiveCanCancel(orderEntryOCOShort)) CancelOrder(orderEntryOCOShort);
                    if (IsOrderActiveCanCancel(orderTarget4)) CancelOrder(orderTarget4);

                }
                else  //batch
                {
                    lock (ordersActiveLockObject)
                    {
                        if (IsOrdersActiveExist(OrdersActive))
                        {
                            Account.Cancel(OrdersActive);
                            OrdersActive.Clear();
                        }
                    }
                }
            }
            else //isRealtime
            {
                Account.CancelAllOrders(this.Instrument);
             }
MicroTrendsTom commented 3 years ago

we can remove the batch altogether and the pore cancel checks if need be

jmscraig commented 3 years ago

another Lock > if(IsOrderActiveCanCancel(orderStop1)) CancelOrder(orderStop1); another Lock > if (IsOrderActiveCanCancel(orderStop2)) CancelOrder(orderStop2); another Lock > if (IsOrderActiveCanCancel(orderStop3)) CancelOrder(orderStop3); etc..

would be fewer calls, less time locked faster to lock once and iterate through OrdersActive[idx] with a for loop and sending CancelOrder(OrdersActive[ i ]) for each existing member of the list.

MicroTrendsTom commented 3 years ago

if(IsOrderActiveCanCancel(orderStop1)) CancelOrder(orderStop1);

would be fewer calls, less time locked faster to lock once and iterate through OrdersActive[idx] with a for loop and sending CancelOrder(OrdersActive[ i ]) for each existing member of the list.

yes true -assuming the list does in fact contain all we need etc. i assume thats what Account.Cancel(OrdersActive); does also etc

MicroTrendsTom commented 3 years ago

OrdersActive is also locked in the get accessor does mean a new lock each iteration or just one time? IsOrderActiveCanCancel has no lock?

let me check in the latest greatest so you can see it in fully glory ha

jmscraig commented 3 years ago

do we null orders as the are filled, cancelled or rejected?

If so how about something as fast and simple as if(orderStop1 != null) CancelOrder.. etc.

Followed by the sweeper Account.CancelAllOrders(this.Instrument);

The methods just above are generally what i have done most often.. And it even started being pretty reliable after I added the sweeper call. Ha ha

jmscraig commented 3 years ago

Wow .. there is that random bold again..

MicroTrendsTom commented 3 years ago

if(orderStop1 != null) CancelOrder.. - the cancelorder would perform the same checks internally we are pre-checking my thinking was precheck saves stacking on more calls from the flow etc the sweeper is for realimte doesnt work in backtest etc

MicroTrendsTom commented 3 years ago

ok commit in and its not tested... im doing that now

MicroTrendsTom commented 3 years ago

backtest was ok

jmscraig commented 3 years ago

Cool Good work for the night. I am beat after testing till 3am the other night..

I gotta log off and get some sleep.

Sleep well!