Samraksh / eMote

eMote OS -- Multiple Ports (using .NET MF v4.3)
0 stars 0 forks source link

VT spam #284

Closed Nathan-Stohs closed 9 years ago

Nathan-Stohs commented 9 years ago

Running Chris' app in #282 (with 800ms timers instead of 100ms), I logged all SetCompare() requests at the timer low-level to memory from start of TinyCLR (no TinyBooter) until the program gets deadlocked in the execution loop. Then I dumped the array with GDB. This was in debug mode with -O2.

I don't know if this is related to #282 , I think it is, but either way this needs attention. Data Formatting is difficult in this medium, I have a spreadsheet if desired.

First column is the actual compare time requested. Second column is difference from last time requested. Last column is the same delta but in milliseconds.

Unfortunately I did not log the current time which could have been illuminating.

Time starts at 0, so all of these requests for times of 4 billion+ are pure garbage. Actually, only 5 of the 317 calls are directly attributable to the program (C# user timers). The rest is junk, minus perhaps a few OS requests.

I was running the logic analyser at the same time, and I counted 5 transitions on the IO. There are exactly 5 lines similar to:

36853460    6225947 778.24

Which I assume are the 800ms user timers, and the only timers that should be running.

It is extremely difficult to debug other problems until this mess is taken care of, and this is almost certainly where our performance bleed is coming from.

After the last timer, no further timer requests are made. It is unknown what happens at the end, if we lose a callback, lose a time, or we never set it up for other reasons. When we deadlock, it is in the C# interpreter, not in sleep or otherwise apparently involving the timer. Difficult to say what its waiting for.

I have not looked at other optimization levels. Likely they look different.

4294967438  4294967438  
4295033065  65627   8.20
4295098692  65627   8.20
4295164319  65627   8.20
4295229946  65627   8.20
4295295573  65627   8.20
4295361200  65627   8.20
4295426827  65627   8.20
4295492456  65629   8.20
4295558083  65627   8.20
4295623710  65627   8.20
4295689337  65627   8.20
4295754964  65627   8.20
4295820593  65629   8.20
4295886220  65627   8.20
4295951847  65627   8.20
4296017474  65627   8.20
4296083101  65627   8.20
4296148730  65629   8.20
4296214357  65627   8.20
4296279984  65627   8.20
4296345611  65627   8.20
4296411238  65627   8.20
4296476867  65629   8.20
4296542494  65627   8.20
4296608121  65627   8.20
4296673748  65627   8.20
4296739375  65627   8.20
4296805004  65629   8.20
4296870631  65627   8.20
4296936258  65627   8.20
4297001885  65627   8.20
4297067512  65627   8.20
4297133142  65630   8.20
4297198769  65627   8.20
4297264396  65627   8.20
4297330031  65635   8.20
4297395658  65627   8.20
4297461285  65627   8.20
4297526912  65627   8.20
4297592539  65627   8.20
4297658168  65629   8.20
4297723795  65627   8.20
4297789422  65627   8.20
4297855049  65627   8.20
4297920676  65627   8.20
4297986305  65629   8.20
4298051932  65627   8.20
4298117559  65627   8.20
4298183186  65627   8.20
4298248813  65627   8.20
4298314442  65629   8.20
4298380069  65627   8.20
4298445696  65627   8.20
4298511323  65627   8.20
4298576950  65627   8.20
4298642579  65629   8.20
4298708206  65627   8.20
4298773833  65627   8.20
4298839460  65627   8.20
4298905087  65627   8.20
4298970716  65629   8.20
4299036343  65627   8.20
4299101970  65627   8.20
4299167597  65627   8.20
4299233224  65627   8.20
4299298853  65629   8.20
4299364480  65627   8.20
4299430107  65627   8.20
4299495734  65627   8.20
4299561361  65627   8.20
4299626991  65630   8.20
4299692618  65627   8.20
4299758245  65627   8.20
4299823880  65635   8.20
4299889507  65627   8.20
4299955134  65627   8.20
4300020761  65627   8.20
4300086388  65627   8.20
4300152017  65629   8.20
4300217644  65627   8.20
4300283271  65627   8.20
4300348898  65627   8.20
4300414525  65627   8.20
4300480154  65629   8.20
4300545781  65627   8.20
4300611408  65627   8.20
4300677035  65627   8.20
4300742662  65627   8.20
4300808291  65629   8.20
4300873918  65627   8.20
4300939545  65627   8.20
4301005172  65627   8.20
4301070799  65627   8.20
4301136428  65629   8.20
4301202055  65627   8.20
4301267682  65627   8.20
4301333309  65627   8.20
4301398936  65627   8.20
4301464565  65629   8.20
4301530192  65627   8.20
4301595819  65627   8.20
4301661446  65627   8.20
4301727073  65627   8.20
4301792704  65631   8.20
4301858331  65627   8.20
4301923958  65627   8.20
4301989593  65635   8.20
4302055220  65627   8.20
4302120847  65627   8.20
4302186474  65627   8.20
4302252101  65627   8.20
4302317730  65629   8.20
4302383357  65627   8.20
4302448984  65627   8.20
4302514611  65627   8.20
4302580238  65627   8.20
4302645867  65629   8.20
4302711494  65627   8.20
4302777121  65627   8.20
4302842748  65627   8.20
4302908375  65627   8.20
4302974004  65629   8.20
4303039631  65627   8.20
4303105258  65627   8.20
4303170885  65627   8.20
4303236512  65627   8.20
4303302141  65629   8.20
4303367768  65627   8.20
4303433395  65627   8.20
4303499022  65627   8.20
4303564649  65627   8.20
4303630278  65629   8.20
4303695905  65627   8.20
4303761532  65627   8.20
4303827159  65627   8.20
4303892786  65627   8.20
4303958416  65630   8.20
4304024043  65627   8.20
4304089670  65627   8.20
4304155297  65627   8.20
4304220924  65627   8.20
4304286553  65629   8.20
4304352180  65627   8.20
4304417807  65627   8.20
4304483434  65627   8.20
4304549061  65627   8.20
4304614690  65629   8.20
4304680317  65627   8.20
4304745944  65627   8.20
4304811571  65627   8.20
4304877198  65627   8.20
4304942827  65629   8.20
4305008454  65627   8.20
4305074081  65627   8.20
4305139708  65627   8.20
4305205335  65627   8.20
4305270964  65629   8.20
4305336591  65627   8.20
4305402218  65627   8.20
4305467845  65627   8.20
4305533472  65627   8.20
4305599101  65629   8.20
4305664728  65627   8.20
4305730355  65627   8.20
4305795982  65627   8.20
4305861609  65627   8.20
4305927238  65629   8.20
4305992865  65627   8.20
4306058492  65627   8.20
4306124119  65627   8.20
4306189746  65627   8.20
4306255376  65630   8.20
4306321003  65627   8.20
4306386630  65627   8.20
4306452265  65635   8.20
4306517892  65627   8.20
4306583519  65627   8.20
4306649146  65627   8.20
4306714773  65627   8.20
4306780402  65629   8.20
4306846029  65627   8.20
4306911656  65627   8.20
4306977283  65627   8.20
4307042910  65627   8.20
4307108539  65629   8.20
4307174166  65627   8.20
4307239793  65627   8.20
4307305420  65627   8.20
4307371047  65627   8.20
4307436676  65629   8.20
4307502303  65627   8.20
4307567930  65627   8.20
4307633557  65627   8.20
4307699184  65627   8.20
4307764813  65629   8.20
4307830440  65627   8.20
4307896067  65627   8.20
4307961694  65627   8.20
4308027321  65627   8.20
4308092950  65629   8.20
4308158577  65627   8.20
4308224204  65627   8.20
4308289831  65627   8.20
4308355458  65627   8.20
4308421087  65629   8.20
4308486714  65627   8.20
4308552341  65627   8.20
4308617968  65627   8.20
4308683595  65627   8.20
4308749225  65630   8.20
4308814852  65627   8.20
4308880479  65627   8.20
4308946114  65635   8.20
4309011741  65627   8.20
4309077368  65627   8.20
4309142995  65627   8.20
4309208622  65627   8.20
4309274251  65629   8.20
4309339878  65627   8.20
4309405505  65627   8.20
4309471132  65627   8.20
4309536759  65627   8.20
4309602388  65629   8.20
4309668015  65627   8.20
4309733642  65627   8.20
4309799269  65627   8.20
4309864896  65627   8.20
4309930525  65629   8.20
4309996152  65627   8.20
4310061779  65627   8.20
4310127406  65627   8.20
4310193033  65627   8.20
4310258662  65629   8.20
4310324289  65627   8.20
4310389916  65627   8.20
4310455543  65627   8.20
4310521170  65627   8.20
4310586799  65629   8.20
4310652426  65627   8.20
4310718053  65627   8.20
4310783680  65627   8.20
4310849307  65627   8.20
4310914938  65631   8.20
4310980644  65706   8.21
4311046271  65627   8.20
4311111898  65627   8.20
4311177550  65652   8.21
4311243177  65627   8.20
4311308804  65627   8.20
4311374431  65627   8.20
4311440058  65627   8.20
4311505685  65627   8.20
4311571312  65627   8.20
4311636939  65627   8.20
4311702612  65673   8.21
4311768239  65627   8.20
4311833866  65627   8.20
4311899518  65652   8.21
4311965145  65627   8.20
4312030772  65627   8.20
4312096399  65627   8.20
4312162026  65627   8.20
4312227653  65627   8.20
4312293280  65627   8.20
4312358907  65627   8.20
4312424534  65627   8.20
4312490161  65627   8.20
4312555788  65627   8.20
17752108    -4294803680 -536850.46
4312555788  4294803680  536850.46
17815169    -4294740619 -536842.58
24053492    6238323 779.79
4312555788  4288502296  536062.79
24053488    -4288502300 -536062.79
24053492    4   0.00
4312555788  4288502296  536062.79
24062433    -4288493355 -536061.67
24062429    -4  0.00
4312555788  4288493359  536061.67
24223541    -4288332247 -536041.53
30457468    6233927 779.24
4312555788  4282098320  535262.29
30457468    -4282098320 -535262.29
30457468    0   0.00
4312555788  4282098320  535262.29
30466409    -4282089379 -535261.17
30466405    -4  0.00
4312555788  4282089383  535261.17
30627513    -4281928275 -535241.03
36853460    6225947 778.24
4312555788  4275702328  534462.79
36853456    -4275702332 -534462.79
36853460    4   0.00
4312555788  4275702328  534462.79
36862401    -4275693387 -534461.67
36862397    -4  0.00
4312555788  4275693391  534461.67
37023509    -4275532279 -534441.53
43257436    6233927 779.24
4312555788  4269298352  533662.29
43257436    -4269298352 -533662.29
43257436    0   0.00
4312555788  4269298352  533662.29
43266377    -4269289411 -533661.18
43266373    -4  0.00
4312555788  4269289415  533661.18
43427481    -4269128307 -533641.04
49653428    6225947 778.24
4312555788  4262902360  532862.80
49653424    -4262902364 -532862.80
49653428    4   0.00
4312555788  4262902360  532862.80
49662369    -4262893419 -532861.68
49662365    -4  0.00
4312555788  4262893423  532861.68

EDIT: This was in master on 4eb63cb926feb01946afa07515e6d8438a45af7b

MichaelAtSamraksh commented 9 years ago

I think developers who has examined the Timer subsystem are frustrated. I know that I have been frustrated ever since I was forced to debug it while looking into something else.

Please do me a big favor and use full function names and at least short file names, otherwise I don't know what the parent post references. For instance, do you mean:

Thank you for referencing the commit used for the tests.

Nathan-Stohs commented 9 years ago

Please do me a big favor and use full function names and at least short file names, otherwise I don't know what the parent post references. For instance, do you mean:

Point taken. I meant STM32F10x_AdvancedTimer::SetCompare(compareValue) (the third bullet)

AnanthAtSamraksh commented 9 years ago

The commit that started the spam seems to be this -- 3d27579. I am looking into the cause and a possible fix. Right now I am not sure what the cause is, but just wanted to update everyone.

MichaelAtSamraksh commented 9 years ago

I may have already fixed it in the bugUninit branch. Do you still see it in the bugUninit branch? In order to solve #243 'Programs with COM2 and with CSMA stop', I had to work through a lot of virtual timer issues last night.

AnanthAtSamraksh commented 9 years ago

The bug is in STM32F10x_AdvancedTimer::SetCompare in file netmf_advancedtimer.cpp. After TIM2 is set with the upper 16 bits of target, there is a check to verify if TIM2 was missed. This check fails for the very first time when VIRT_TIMER_TIME is set with a compareValue of 2^32. So TIM1 is set with the lower 16 bits of target and this causes the 8 ms timer spam that lasts for about 2.25 seconds. The fix for this would be to set TIM1 only if TIM2->CNT is not zero. However, I am not sure if this check is robust enough. Btw, the bug exists in the bugUninit branch also.

Nathan-Stohs commented 9 years ago
Nathan-Stohs commented 9 years ago

SPECULATION:

Actually I have a guess. Is the VT passing down compare values that are actually >= 2^32, instead of passing down at most MAX_TICKS or whatever (0xFFFFFFFF)?

Furthermore I'm guessing that the MAX_TICKS (the large default idle value) that the VT passes to SetCompare is actually somehow MAX_TICKS + HAL_CurrentTicks() and so the time % 32-bit wraps around to the current time, and make the driver think that "large time in the future" is really "now".

This goes back to how VT is schizophrenic on if the adtim is 32-bit or 64-bit and tries to do both, and who actually keeps track of the "real" 64-bit time and 32-bit overflows. Specifically, here the VT Is wrapping time at 32-bit but expecting a 64-bit compare to work, so we have mismatch.

To fix, do one of:

ChrisAtSamraksh commented 9 years ago

Yes, the VT does pass a 64-bit time instead of MAX_TICKS. At the beginning of STM32F10x_AdvancedTimer::SetCompare we check whether the 64-bit time we pass is less than "now" which is a 32-bit time. There will be problems at that point once the passed time is larger than a 32-bit max value. That might be the problem I am tracking down right now different than the VT spam issue.

The code isn't supposed to ever expect a 64-bit time to work. The code (currently) expects the upper 32-bits to be ignored after checking whether or not we need the "TIME_CUSHION".

ChrisAtSamraksh commented 9 years ago

I also bet "TIME_CUSHION" is the time of one of the VT spam.

I'm testing a fix now for the VT spam and my other timer issue that occurs after the 32-bit timer wrap around.

Nathan-Stohs commented 9 years ago

The "epoch" code in the adtim that I have up in the pull request handles all 64-bit in the adtim, but we'd need to get the VT on board.

AnanthAtSamraksh commented 9 years ago

Why does the check miss?

When STM32F10x_AdvancedTimer::SetCompare is called with 2^32, the result of this operation is 0.

tar_upper = (compareValue >> 16) & 0xFFFF

TIM2 is then set with 0. The if statement in the check is as follows:

if (TIM2->CNT != tar_upper || TIM_GetITStatus(TIM2,TIM_IT_CC1) == SET )

Both TIM2->CNT and tar_upper are 0 and the status of TIM2 is RESET (not sure why, but found this while debugging. It always is RESET) and the check fails and TIM1 is set.

Why does this happen more than once?

As soon as the first check fails, TIM1 is set, which invokes VTCallBack (from ISR_TIM1 callback), which in turn invokes SetCompare. TIM2->CNT is equal to tar_upper and so the check fails, TIM1 is set again and the cycle repeats.

Nathan-Stohs commented 9 years ago

When STM32F10x_AdvancedTimer::SetCompare is called with 2^32, the result of this operation is 0.

Right, and the problem is that the VT, which assumes that the adtim is 32-bit, is passing in 64-bit values (or at least, something greater than 0xFFFFFFFF). So these assumptions are conflicting. See my data dump above, the VT is passing down, for example, 4295033065, which is > 2^32, which doesn't grok with MAX_TICKS == 0xFFFFFFFF

But its actually more sinister than that, its not just passing in 2^32 (or 2^32-1), its passing in 2^32+current time, so the timer doesn't just wrap sometimes, it always wraps.

The adtim, because it is 32-bit, throws away the upper bits, and subtracts 2^32, so we end up with 2^32 + currentTime - 2^32 which of course equals the currentTime, and the adtim driver thinks that means "fire now" and we move on to setting TIM1.

I actually think the adtim driver should be 64-bit, but I don't think I'm going to get my way there, at least not right now. But the important thing is that the VT and the adtim need to agree on what it is. The current adtim driver is 32-bit, but the VT tries to treat it as both 32-bit and 64-bit, which is causing the failure here.

ChrisAtSamraksh commented 9 years ago

If we want to wait for max time that the 32-bit counter can do (say 0xffff ffff) and current time is 0x0100 0100 then we would set the compare for (0x0100 00ff) right? In that case our current code would think we "missed" and we set the lower for 0x00ff which goes off in 8.19 ms.

Nathan-Stohs commented 9 years ago

If we want to wait for max time that the 32-bit counter can do (say 0xffff ffff) and current time is 0x01000100 then we would set the compare for (0x100 00ff) right? In that case our current code would think we "missed" and we set the lower for 0x00ff which goes off in 8.19 ms.

No, you wouldn't set it to that. You always have to break at 0xFFFFFFFF. Remember, you are only looking at the bottom 32-bits. If the adtim is given a compare request for now-1, it has no way to know if that request is for a timer that missed or a wrap. So again, its a 32/64 bit issue.

ChrisAtSamraksh commented 9 years ago

So an idle value time (a very big number) is passed for us to trigger on. Our time is 0x0100 0100 (looking at only the lower 32-bits of the system time). What do we set the upper 16-bit timer to?

Nathan-Stohs commented 9 years ago

0xFFFF

ChrisAtSamraksh commented 9 years ago

Check ba516cf2d3fe50d12cdd93448c3ace1562b705f3. I believe it fixes the spam problem and a few other bugs.

Nathan-Stohs commented 9 years ago

Looks fine, haven't tried it.

FYI what I believe you have implemented is effectively a 33-bit timer, with the MSB in software.

ChrisAtSamraksh commented 9 years ago

Fixed by aa6ba2f2bf3c8f8cee0da9e46d42803dec6940d0