KvdV49 / propforth

Automatically exported from code.google.com/p/propforth
1 stars 0 forks source link

LogicAnalyzer - sampling to hub every 40 cycles can't possibly work #54

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
It's simple numbers. The loop contains a hub operation which makes the loop 
time 16n. 40 isn't. So they interfere with each other. I'd suggest to add a 
note that the sample time must be a multiple of 16 or add code which makes sure 
that this is the case. FWIW, the minimum time for this sample loop can be 32 
cycles.

                mov     _treg2 , cnt
                add     _treg2 , __Dv_LScnt

__1             wrlong  _treg1 , __Fv_LbufP
                waitcnt _treg2 , __Dv_LScnt

Minimum for __Dv_LScnt is 9 + worst case hub op (1st only), i.e. 9 + 23 = 32 
which luckily is also 16n.

Original issue reported on code.google.com by marko.lu...@kyi.biglobe.ne.jp on 14 Sep 2011 at 4:56

GoogleCodeExporter commented 8 years ago
Note that a 32 cycle setup requires specific hub timing which isn't always 
guaranteed provided a waitpxx trigger is used. The no-trigger version however 
could benefit from that.

Original comment by marko.lu...@kyi.biglobe.ne.jp on 14 Sep 2011 at 5:03

GoogleCodeExporter commented 8 years ago
This limitation is mainly due to the fact that there are only 6 cycles left in 
a 32 cycle loop, i.e. it works 6 out of 16. Using 48 as a minimum avoids that 
trap.

Original comment by marko.lu...@kyi.biglobe.ne.jp on 14 Sep 2011 at 5:20

GoogleCodeExporter commented 8 years ago
Thanks for this analysis, Marko.  
We'll work on this when Sal gets back form his travels.
Should be some time this month. 

Original comment by prof.bra...@gmail.com on 14 Sep 2011 at 8:22

GoogleCodeExporter commented 8 years ago
I am not sure I understand what you are saying, so I will
describe the code and the requirements in hopes that we can
clarify.

The primary requirements for this routine:

1. Sample every n cycles exactly
2. Provide a trigger mechanism for sampling
3. Sample as quickly as possible after the trigger
4. Have the dynamic range of n be as large as possible

So I have annotated the routine with timings, and by my
estimation the fastest I can meet these requirements is
39 cycles, hence the lower limit of 40.

If I understand the spec, the longest wrlong can ever take
is 22 cycles.

So as long as the worst case scenario is accounted for, the code
loop should meet the above requirements.

Now if the requirement is only how fast we can get the loop to run:

The loop time would be 32 clocks. But unfortunately this does not meet
the above requirements, but it may be useful elsewhere.

TIMINGS of orginal code:

:asm

\ the mask of input bits we care about
v_LTm
0
\ the value of bits before we trigger
v_LTb
0
\ the value of bits we trigger on
v_LTa
0

\ a _sample ( -- )
\

a_sample
\ wait for trigger values
    waitpeq v_LTb , v_LTm
    waitpeq v_LTa , v_LTm

\ get the sample and set up the count for the next sample 
\                                    Minimum Time   Maximum Time    Minimum 
Time   Maximum Time
\                                              
    mov _treg1 , ina
\                          SAMPLE->  4       4      4       4
    mov _treg2 , cnt
\                                    4       8      4       8
    add _treg2 , __Dv_LScnt
\                                    4       12     4       12
__1
\ write out the sample
    wrlong _treg1 , __Fv_LbufP
\                                    7       19     22      34
\                                                                   7       19  
   22      34 
\ wait for the next sample time
    waitcnt _treg2 , __Dv_LScnt
\                                    5       24     5       39 + 
\                                                                   5       24  
   22      39 + 
    mov _treg1 , ina
\                                                      SAMPLE ->    4       4   
   4       4
    add __Fv_LbufP , # 4
\                                                                   4       8   
   4       8
    djnz __Ev_LSz , # __1
\                                                                   4       12  
   4       12

\ we are done
    jnext
\ a pointer to the sample buffer
v_LbufP
__F
0
\ the sample buffer size
__E
v_LSz
0

\ the number of clock between samples (decimal 40 min)
__D
v_LScnt
0

;asm

Original comment by salsa...@gmail.com on 2 Oct 2011 at 10:18

GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
[deleted comment]
GoogleCodeExporter commented 8 years ago
OK, first of all, apologies! Unfortunately I don't have the original test code 
anymore but my best bet is that the initial waitcnt setup is crucial and it 
probably wasn't in this context. I know for a fact that I had lockups for 
everything not 16n.

Regardless, 40 cycles isn't the actual safe minimum (it locks up 2 out of 16 
times).

I reviewed the timing and all we have to do is raise the minimum to 41 which is 
the 39 from your calculations +1 (hubops are 8..23) and +1 (waitcnt is 6+).

Original comment by marko.lu...@kyi.biglobe.ne.jp on 3 Oct 2011 at 2:08

GoogleCodeExporter commented 8 years ago
I have checked the specs and the data sheet, wrlong is spec'd at 7-22 and 
waitcnt is spec'd at 5+, in version 1.1, when I wrote the code.

I have never had release versions of LogicAnalyzer lock up, and it is a tool I 
use a lot.

But the chips/boards I am using for test are 3+ years old. I note more recent 
docs Manual 1.2 and spec 1.4 have changed timings. Was the change an errata or 
an update to reflect more recent steppings?

I agree the minimum timings have to changes. This easy enough to change. But I 
want to do some reasonable analysis testing to ensure it is reliable.

Will update in 5.0.

Do you agree with the updated timing analysis?

BTW thank you for pointing this out.

So the updated timing analysis should be:

:asm

\ the mask of input bits we care about
v_LTm
0
\ the value of bits before we trigger
v_LTb
0
\ the value of bits we trigger on
v_LTa
0

\ a _sample ( -- )
\

a_sample
\ wait for trigger values
    waitpeq v_LTb , v_LTm
    waitpeq v_LTa , v_LTm

\ get the sample and set up the count for the next sample 
\                                    Minimum Time   Maximum Time    Minimum 
Time   Maximum Time
\                                              
    mov _treg1 , ina
\                          SAMPLE->  4       4      4       4
    mov _treg2 , cnt
\                                    4       8      4       8
    add _treg2 , __Dv_LScnt
\                                    4       12     4       12
__1
\ write out the sample
    wrlong _treg1 , __Fv_LbufP
\                                    8       20     23      35
\                                                                   8       20  
   23      35 
\ wait for the next sample time
    waitcnt _treg2 , __Dv_LScnt
\                                    6       26     6       41 + 
\                                                                   6       26  
   23      41 + 
    mov _treg1 , ina
\                                                      SAMPLE ->    4       4   
   4       4
    add __Fv_LbufP , # 4
\                                                                   4       8   
   4       8
    djnz __Ev_LSz , # __1
\                                                                   4       12  
   4       12

\ we are done
    jnext
\ a pointer to the sample buffer
v_LbufP
__F
0
\ the sample buffer size
__E
v_LSz
0

\ the number of clock between samples (decimal 40 min)
__D
v_LScnt
0

;asm

Original comment by salsa...@gmail.com on 3 Oct 2011 at 3:55

GoogleCodeExporter commented 8 years ago
> I have checked the specs and the data sheet, wrlong is spec'd at 7-22 and 
waitcnt > is spec'd at 5+, in version 1.1, when I wrote the code.

Those timings were never correct. After more than 2 years of chewing Parallax's 
ears I managed to convince them to update the documentation.

> I have never had release versions of LogicAnalyzer lock up, and it is a tool 
I use > a lot.

It's tricky but easy enough to reproduce.

        mov     _treg1 , ina
        mov     _treg2 , cnt            '  -8                               (%%)
        add     _treg2 , __Dv_LScnt     '  -4                               (%%)

        wrlong  _treg1 , __Fv_LbufP     '  +0 = perfect alignment           (%%)
        waitcnt _treg2 , __Dv_LScnt     '  +8   %% consumes __Dv_LScnt + 5  (%%)
        mov     _treg1 , ina            ' +37
        add     __Fv_LbufP , # 4        ' +41
        djnz    __Ev_LSz , # __1        ' +45

        wrlong  _treg1 , __Fv_LbufP     ' +64 = 15 idle cycles (waiting for window)
        waitcnt _treg2 , __Dv_LScnt     ' +72

The first waitcnt sees the match at 34 (37-3) the second would at 75 (72+3). 
Unfortunately that's one cycle late. As I said it's 2 in 16 so it may well work 
for you without noticing that there is a glitch.

> Do you agree with the updated timing analysis?

Looks OK (formatting is a bit odd). I would have done it differently, the loop 
- not counting exit - is basically

  waitcnt
  12 cycles
  hubop

which amounts to worst case 6 + 12 + 23 = 41 so that the next waitcnt doesn't 
stall. The entry condition is

  sample cnt
  add delay
  hubop
  waitcnt

which requires 8 + 23 + 6 = 37 to be safe. Which leaves us with 41. I have a 
test program here which'll show you above error case.

> BTW thank you for pointing this out.

No problem. I still spend this Monday being embarrassed :)

Original comment by marko.lu...@kyi.biglobe.ne.jp on 3 Oct 2011 at 4:31

Attachments:

GoogleCodeExporter commented 8 years ago
FWIW, fastest single cog ina-to-hub transfer these days is 16 cycles (as much 
as there is space in hub). Makes trigger based sampling a bit tricky though 
since you're tied to the hub window. HTH

Original comment by marko.lu...@kyi.biglobe.ne.jp on 3 Oct 2011 at 4:44

GoogleCodeExporter commented 8 years ago
> ... (formatting is a bit odd).

I think I get it now. Left is the entry condition, right side is the loop 
behaviour. In that case the 41+ left is a bit misleading since the initial 4 
cycles for sampling ina are not relevant re: timing constraints. That said, I 
don't know what your trying to express here. If it's just the time line then I 
agree. OTOH, if you're after the constraints then it probably should be made 
more clear, i.e. the 4 instructions starting from mov _treg2, cnt up to and 
including waitcnt consume __Dv_LScnt + 5 cycles. 14 (min) of which are consumed 
by setup and waitcnt (4+4+6) which leaves the remainder for the hubop, e.g. 
delay = 41, total runtime 46 cycles, hubop has a huge 32 cycle window (more 
than enough).

Original comment by marko.lu...@kyi.biglobe.ne.jp on 3 Oct 2011 at 5:06

GoogleCodeExporter commented 8 years ago
I am currently working on porting everything to 5.0, and LogicAnalyzer is on my 
list in the next few weeks. I am going to add a regression test. 

Generate an nn cycle square wave on pin xx
Run the different sample routines
Do an in memory check of the square wave to make sure there are no phasing 
error (or any other) errors.

By running this test for each sample routine, should be able to verify the 
sample routines are working properly, on all sample rates. By sampling at the 
signal frequency (which is really undersampling) or any integer multiple of the 
frequency, a phase error should show up if there are any problems.

I did this manually in the early versions, but it never made it into the test 
cycle, and it did not make the 4.6 regressions.

Will work with Doug to get an updated LogicAnalyzer out in the shorter term.

Original comment by salsa...@gmail.com on 3 Oct 2011 at 1:41

GoogleCodeExporter commented 8 years ago
Some empirical results:

Set up a test to sample continuously into main memory with the above routine
3000 samples.

At a sample interval of 39 cycles it fails immediately.

At a sample interval of 40 cycles it eventually fails ( I heated the propeller 
with a hair dryer and it would start to fail usually within 5 minutes - did not 
have an accurate way to measure temp, but it was hot).

At a sample interval of 42 cycles I have not yet seen a failure.

Original comment by salsa...@gmail.com on 5 Oct 2011 at 2:19

GoogleCodeExporter commented 8 years ago
How did you trigger the sample start?

Original comment by marko.lu...@kyi.biglobe.ne.jp on 5 Oct 2011 at 2:45

GoogleCodeExporter commented 8 years ago
I wrote specialized code in 5.0, generated a signal, and sample it.

Will make sure it makes it into the 5.0 regression.

la in 5.0 now has an added routine, which can sample from 18 clock up. If you 
would do a deep dissection on these, I would be very happy to get your feedback.

\ this variable which is used to synch the 4 cogs which are doing interleaved 
sampling
variable _la_s1time

wvariable _la_s1addr

\ this variable value is the first of 4 sequential cogs used to sample every 
clock cycle, default to this cog 2

wvariable _la_s1cog 2 _la_s1cog W!
wvariable _la_s1size

\ _la_is1 ( -- ) this is the interleaved routine used by 4 cogs to sample very 
clock cycle
: _la_is1
    _la_s1addr W@ cogid _la_s1cog W@ - 2* 2* +
    _la_s1time L@ cogid _la_s1cog W@ - +
    _la_asample1
    2* 2* _la_s1size W!
;

\ _la_sample1 ( baseaddr numsamples samplecycle triggerbefore triggerafter 
triggermask -- numsamples)
: _la_sample1
    3drop 2drop
    _la_s1addr W!
    _la_s1cog W@ dup cogreset
    1+ dup cogreset
    1+ dup cogreset
    1+ cogreset
    zeroFreeDict
    x10 delms
    clkfreq cnt COG@ + _la_s1time L!
    c" _la_is1"
    dup _la_s1cog W@ cogx
    dup _la_s1cog W@ 1+ cogx
    dup _la_s1cog W@ 2+ cogx
    _la_s1cog W@ x3 + cogx
    _la_s1time L@ x8000 + 0 waitcnt drop
    _la_s1size W@
;

\ _la_sample ( baseaddr numsamples samplecycle triggerbefore triggerafter 
triggermask -- numSamples )
: _la_sample
    x4 ST@ x310 >=
    if
        x3 ST@ x29 >=
        if
            _la_asample41+
        else

            x3 ST@ x12 >=
            if
                _la_asample18+
            else

                x3 ST@ x4 =
                if
                    _la_asample4
                else

                    x3 ST@ 1 =
                    if
                        _la_sample1
                    else

                        3drop 3drop 0
                    then
                then
            then
        then
    else
        3drop 3drop 0
    then
;

\ _la_asample41+ ( baseaddr numsamples samplecycle triggerbefore triggerafter 
triggermask -- numsamples)
build_BootOpt :rasm
\ trigger mask
    mov $C_treg6 , $C_stTOS
    spop
\ trigger after
    mov $C_treg5 , $C_stTOS
    spop
\ trigger before
    mov $C_treg4 , $C_stTOS
    spop

\ sample cycle
    mov $C_treg3 , $C_stTOS
    spop

\ num samples
    mov $C_treg2 , $C_stTOS
    spop

\ base address - $C_treg1
    mov $C_treg1 , $C_stTOS

    mov $C_stTOS , $C_treg2
\    
\ $C_treg1 - baseaddr
\ $C_treg2 - numsamples
\ $C_treg3 - samplecycle
\ $C_treg4 - triggerbefore
\ $C_treg5 - triggerafter
\ $C_treg6 - triggermask
\
\
\ wait for trigger
\
    waitpeq $C_treg4 , $C_treg6
    waitpeq $C_treg5 , $C_treg6
\
\ get the sample and set up the count for the next sample 
\
\
\                                               t = 0
    mov $C_treg6 , ina
\                                               t = 4
    mov $C_treg5 , cnt
\                                               t = 8
    add $C_treg5 , $C_treg3
\
\ $C_treg1 - baseaddr
\ $C_treg2 - numsamples
\ $C_treg3 - samplecycle
\ $C_treg4 - triggerbefore
\ $C_treg5 - nextcounttosample
\ $C_treg6 - current sample
\
__1
\
\ write out the sample
\
\                                               t = 12
    wrlong  $C_treg6 , $C_treg1
\
\ wait for the next sample time
\
\                                               t = 20 - 35
    waitcnt $C_treg5 , $C_treg3
\                                               t = 26 - 41
\                                                            t = 0
    mov $C_treg6 , ina  
\                                                            t = 4
    add $C_treg1 , # 4
\                                                            t = 8
    djnz    $C_treg2 , # __1
\                                                            t = 12

\ we are done
    jexit
\
;asm _la_asample41+

\ _la_asample18+ ( baseaddr numsamples samplecycle triggerbefore triggerafter 
triggermask -- numsamples )
build_BootOpt :rasm
\ trigger mask
    mov $C_treg6 , $C_stTOS
    spop
\ trigger after
    mov $C_treg5 , $C_stTOS
    spop
\ trigger before
    mov $C_treg4 , $C_stTOS
    spop

\ sample cycle
    mov $C_treg3 , $C_stTOS
    spop

\ num samples
\   mov $C_treg2 , $C_stTOS
    spop

\ base address - $C_treg1
    mov $C_treg1 , $C_stTOS

    mov $C_treg2 , # par
    sub $C_treg2 , # __buffer
    mov $C_stTOS , $C_treg2
\    
\ $C_treg1 - baseaddr
\ $C_treg2 - numsamples
\ $C_treg3 - samplecycle
\ $C_treg4 - triggerbefore
\ $C_treg5 - triggerafter
\ $C_treg6 - triggermask
\
\
\ wait for trigger
\
    waitpeq $C_treg4 , $C_treg6
    waitpeq $C_treg5 , $C_treg6
\
\ get the sample and set up the count for the next sample 
\
\
\                                               t = 0
    mov __buffer , ina
\                                               t = 4
    mov $C_treg5 , cnt
\                                               t = 8
    add $C_treg5 , $C_treg3
\
\ $C_treg1 - baseaddr
\ $C_treg2 - numsamples
\ $C_treg3 - samplecycle
\ $C_treg4 - triggerbefore
\ $C_treg5 - nextcounttosample
\ $C_treg6 - current sample
\
__1
\
\ wait for the next sample time
\
\                                               t = 12
    waitcnt $C_treg5 , $C_treg3
\                                               t = 18
\                                                            t = 0
__2
    mov __buffer1 , ina 
\                                                            t = 4
    add __2 , $C_fDestInc
\                                                            t = 8
    djnz    $C_treg2 , # __1
\                                                            t = 12

    mov $C_treg2 , $C_stTOS
__3
    wrlong  __buffer , $C_treg1

    add __3 , $C_fDestInc
    add $C_treg1 , # 4

    djnz    $C_treg2 , # __3

\ we are done
    jexit
__buffer
 0
__buffer1
 0
\
;asm _la_asample18+

\ _la_asample4 ( baseaddr numsamples samplecycle triggerbefore triggerafter 
triggermask -- numsamples )
build_BootOpt :rasm
\ trigger mask
    mov $C_treg6 , $C_stTOS
    spop
\ trigger after
    mov $C_treg5 , $C_stTOS
    spop
\ trigger before
    mov $C_treg4 , $C_stTOS
    spop

\ sample cycle
    spop

\ num samples
    spop

\ base address - $C_treg1
    mov $C_treg1 , $C_stTOS

    mov $C_treg2 , # par
    sub $C_treg2 , # 1
    movd    __3 , $C_treg2

    sub $C_treg2 , # __buffer
    mov $C_stTOS , $C_treg2

__1
    mov __buffer , __inainst
__2
    movd    __buffer , # __buffer
    add __1 , $C_fDestInc
    add __2 , $C_fDestInc
    add __2 , # 1
    djnz    $C_treg2 , # __1

    movs    __jmpinst , # __4
__3
    mov 0 , __jmpinst

    jmp # __sample

__4

    mov $C_treg2 , $C_stTOS
__5
    wrlong  __buffer , $C_treg1

    add __5 , $C_fDestInc
    add $C_treg1 , # 4

    djnz    $C_treg2 , # __5

\ we are done
    jexit

__inainst
    xA0BC01F2
__jmpinst
    x5C7C0000
\
\ wait for trigger
\
__sample
    waitpeq $C_treg4 , $C_treg6
    waitpeq $C_treg5 , $C_treg6
__buffer
 0

\
;asm _la_asample4

\ _la_asample1 ( baseaddr startcount -- numsamples )
build_BootOpt :rasm
\ startcount
    mov $C_treg6 , $C_stTOS
    spop

\ base address
    mov $C_treg1 , $C_stTOS

    mov $C_treg2 , # par
    sub $C_treg2 , # 1
    movd    __3 , $C_treg2

    sub $C_treg2 , # __buffer
    mov $C_stTOS , $C_treg2

__1
    mov __buffer , __inainst
__2
    movd    __buffer , # __buffer
    add __1 , $C_fDestInc
    add __2 , $C_fDestInc
    add __2 , # 1
    djnz    $C_treg2 , # __1

    movs    __jmpinst , # __4
__3
    mov 0 , __jmpinst

    jmp # __sample

__4

    mov $C_treg2 , $C_stTOS
__5
    wrlong  __buffer , $C_treg1

    add __5 , $C_fDestInc
    add $C_treg1 , # x10

    djnz    $C_treg2 , # __5

\ we are done
    jexit

__inainst
    xA0BC01F2
__jmpinst
    x5C7C0000
\
\ wait for trigger
\
__sample
    waitcnt $C_treg6 , # 0
__buffer
 0

\
;asm _la_asample1

Original comment by salsa...@gmail.com on 12 Oct 2011 at 6:31

GoogleCodeExporter commented 8 years ago
Mostly cosmetic stuff:

_la_asample1:
    mov $C_treg2 , # par
    sub $C_treg2 , # 1

can be combined as well as

    add __2 , $C_fDestInc
    add __2 , # 1

If you don't have space for a $C_fDestIncSrcInc then make sure carry is set and 
use

    addx    __2 , $C_fDestInc

This can be achieved by adding wc here (__inainst has s[31] set)

__1
    mov __buffer , __inainst wc

Some (if not all) of the above also applies to _la_asample4 (instruction 
combining).

Original comment by marko.lu...@kyi.biglobe.ne.jp on 14 Oct 2011 at 2:04

GoogleCodeExporter commented 8 years ago
41+ and 18+ look OK.

Original comment by marko.lu...@kyi.biglobe.ne.jp on 14 Oct 2011 at 2:07

GoogleCodeExporter commented 8 years ago
Thanks for the review, will do a cleanup pass for release. fDestInc is used by 
the forth core interpreter.

Will do a combine pass, as for 18+ and 41+ every instruction occupies space 
that could hold a sample.

Original comment by salsa...@gmail.com on 14 Oct 2011 at 1:31

GoogleCodeExporter commented 8 years ago
SPECS CHANGED after code was written, 
This fixed something that would have been lost

Original comment by prof.bra...@gmail.com on 21 Dec 2011 at 4:55