faster lock, etc - Githubissues

IDC-Dragon commented 1 year ago

No issue, just abusing this for discussion / kicking ideas. Or is there another forum?

First, let me marvel your code contribution. Pico was definitely missing this. I've been loosely considering myself writing an S/PDIF input, but it would have been my first Pico let alone PIO project. Looking at the complexity of yours, a recipe for failure. I still need to understand the PIO code and the sync mechanism. The things I looked at are rather two-pass, first capture 2 bits or duration, then work out the sync and decode. I still lack to understand how you got away with single pass, having the payload directly available, and still detect the biphase violations for sync. (More documentation appreciated.) Here was some discussion: https://forums.raspberrypi.com/viewtopic.php?t=318134

About the idea part:

The current code takes quite a while to sync, trying all sample rates and polarities. (BTW, what time to give per sample rate?) How about having something dedicated for the detection? First, measuring the frequency (if any) of some kind of software PLL/FFT/shortest pulses can give a quick sample rate guess. Not sure about detecting the polarity. Sync has 3 steady bits, but depending on previous bit they can have any polarity?

Edit: Perhaps different PIO code per sample rate is not necessary, if clocked slower by a factor of 2/4? Each state machine even has a fractional divider, so could be fine-tuned.

Edit2: I think the 2nd PIO code for the inverted variant is not necessary, if the input pin polarity can be changed by GPIO setting.

Less latency than a full block of 192 samples would be nice for realtime applications. With 44kHz, that's already >4ms.

In general, if and what further plans do you have with this project? From the commit frequency, it is work in progress?

elehobica commented 1 year ago

Thanks. no problem to discuss here. All I referred to when I started this to know how to synchronize with bit stream was _manchesterrx. (Edit: revised to correct link)

And after 3 steady bits detection was done, manual push when sync detected in stead of auto push allowed the alignment in 32bit output, which led one-pass processing. (before doing this, I've also touched two-pass. I've tried to detect sub frame in second pass and found it was too much expectation against PIO instruction sets.)

It's a good remark to get faster sync. What application do you expect for faster frequency detection? Anyway, I consider fast detection should come with safe detection.

Yes, clock divider 1/2, 1/4 should work to apply 192MHz program to 96MHz and 48MHz if we don't pursue the center timing of symbol to be sampled in lower sampling frequencies. It could be the discussion point because now lower frequency has more timing precision. btw, I found the current _pio_sm_setclkdiv() doesn't have any effect because I located it at wrong position.

Do you know how to invert GPIO for PIO input? If possible, yes, inverted detection code is not needed.

As a first step, I'd like to try just sampling the data pin by dedicated PIO program. I hope it should give enough information to do several analysis like shortest pulse width. Detecting polarity seems something more to do.

IDC-Dragon commented 1 year ago

Hi, thank you for the answer.

Input pin polarity inversion seems not possible, the pad has no such bit. I thought that would be kind of standard...

For the detection, I have a proof of concept working. Measuring pulses with PWM timer and DMA, that's even more accurate than PIO. Inspired by this: https://forums.raspberrypi.com/viewtopic.php?t=337644#p2021641 The rising edges give DREQ for a DMA capturing a free running full speed timer. The difference between the captured 16bit words is the time between rising edges (not the pulse width).

I can also determine the polarity. The trick is that only the positive M preamble has such a pattern: (0)11100010(1). The two bold '1' are 6 clocks apart, the furthest rising edges we can get. Inverted, the maximum is 5, across all preamble situations. (I made an Excel sheet with bit patterns to review it.)

Edit: Having measured the bit rate rather accurately this way, and fractional scaling the PIO-clock, the PIO reception could be generic and even flexible for non-standard sample rates. Write the fastest PIO code with perfect center-of-the-eye sampling, regardless of what bit rate that would achieve. (I think that would be cy=4.) Then scale the PIO clock to slow it down accordingly.

We should add a timeout interrupt to the PIO DMA ISR, to detect sync loss and re-trigger a pulse measurement. Then wrap the flow into a state machine. Add some optional notification for the user. This way, the S/PDIF reception can run unattended, without a main loop polling for sync.

elehobica commented 1 year ago

Thanks for your reply.

Indeed, detecting edge to kick DMA to read from timer looks suitable for precise width measurement. On the other hand, I would like be cautious about introducing additional resource as PWM/Timer in this library. In my understanding, if targeting to detect one out of standard frequencies, sampling by PIO would be enough to distinguish each other. I'm not positive to expand support to non-standard frequencies when considering range of application generally supposed.

Your idea to determine polarity by Sync code M seems great. Using this method, the PIO program to be applied should be determined with information about minimum pulse width and which rising/falling edge interval has 6 stride interval. I will be able to implement this scheme with PIO sampling soon.

The sample timing accuracy depends on two factors. One is PIO sampling speed (interval) and another is accuracy of PIO frequency to input sampling frequency. Shorter sampling interval can minimize the delay between edge detection timing by wait instruction and actual signal's edge. Accurate PIO frequency can suppress the accumulation of sampling timing error. Current PIO program can reset the accumulated timing error by wait within relatively short symbols, therefore, I'd rather put more weight for PIO sampling speed (=125MHz) aspect.

Addition: Measure-Pulse-by-PIO is the implementation.

IDC-Dragon commented 1 year ago

Oh, quite some change! I see you wrote a 1-bit logic analyzer. ;-)

I thought PIO code would be less accurate in timing the changes, but with just sampling it's full speed. And you can offline analyze both rise and fall from one capture. So this is indeed better than my PWM timer capture.

The new measurement code doesn't work for me. Enabling the printf(), minimum is jumping all over the range, can even be 0. Need to check in more detail on another day, I don't understand the analysis loop yet.

Just a remark what I found in my experiments: There is a potential pitfall because of data dependency. Only when viewing with the right "polarity glasses" on you are guaranteed to see the minimum edge distance as 2 cycles and the maximum as 6 cycles. When looking the other way round, the preambles themselves don't yield extrema, it depends on the data. Minimum may be 3 cycles (if no subsequent 1's in the data), maximum may be 4 or 5, depending on the parity (last bit before preamble).

elehobica commented 1 year ago

Thanks. If minimum happens to be 0, the incoming data includes some noise. Wave width check should be inevitably severe because all samples should be checked. With timer, much more maybe.

There is a potential pitfall because of data dependency.

I don't think so. Minimum is guaranteed by sync codes and sample words should include at least 4 sync codes. In my environment, it works quite stable for all frequencies. However if it makes difference in yours, there could be something.

Add: I understand what you're saying exactly. However, supposing that Sync M Code is always included, minimum width should appear in both polarity.

0 -> 1 1 1 0 0 0 1 0 -> 1 1 -> 0 0 0 1 1 1 0 1 -> 0

IDC-Dragon commented 1 year ago

However, supposing that Sync M Code is always included, minimum width should appear in both polarity.

0 -> 1 1 1 0 0 0 1 0 -> 1 1 -> 0 0 0 1 1 1 0 1 -> 0

You will safely catch it in one of the computed minima, but not both. There are 4 combinations to consider, not 2: rising/falling minimum, signal being normal/inverted.

IDC-Dragon commented 1 year ago

That said, it would be enough to maintain just one minimum. I still don't know why the detection misbehaves for me. Meanwhile I studied your code and understand the intention. ;-) Some thoughts:

ARM has a CLZ instruction, should be utilized by __builtin_clz(). This could be useful and replace the de Bruijn bithack. You'd have to shift the other way, MSB first. Data also reads more nicely then, when printed with "%032b" format.
Bit timing checks should not be too strict. Electrically speaking, rise and fall times may be different and also depend on past data.
Not sure if the PIO program should block at the beginning, waiting for an edge. (What does the X counter do?) How about just free running and wait for the DMA to finish? Saves the sleep_us() and would be more ready towards an interrupt-driven state machine. It would be one instruction then, small enough to permanently have it in another slice of the same PIO.

elehobica commented 1 year ago

Hello.

The new measurement code doesn't work for me. Enabling the printf(), minimum is jumping all over the range, can even be 0.

I think I found the situation which you've encountered. If the analysis of first time has failed due to signal with noise (thus it pass through wait blocking), it looks unreasonably hard to get succeeded after 2nd time even though incoming signal is stable. After I inserted FIFO clear code at several points, it gets much better. (could be still not complete yet). I hope this will work to improve your situation.

elehobica commented 1 year ago

ARM has a CLZ instruction, should be utilized by __builtin_clz(). This could be useful and replace the de Bruijn bithack. You'd have to shift the other way, MSB first. Data also reads more nicely then, when printed with "%032b" format.

Thanks for this. Indeed, processing time looks same. I've just replaced with __builtin_ctz().

IDC-Dragon commented 1 year ago

There is no CTZ instruction, only CLZ. You really have to work from the beginning, MSB first. That's why you saw no improvement. Here is some test code to walk a buffer and scan for edges:

uint32_t buffer[] = {
    0b00000000000000000000000000000001,
    0b11111111111111111111111111111111,
    0b10000000000000000000000000000000,
};

const int buf_size = sizeof(buffer) / sizeof(buffer[0]);
int buf_index = 0; // incremented index within buffer
int edge_pos = 0; // accumulation for edge position
uint32_t shiftreg = buffer[buf_index++]; // working shift register
uint32_t polarity = 0 - (shiftreg>>31); // all bits set if MSB==1
shiftreg <<= 1; // consume the bit which went into polarity
int bits_left = 31; // shift register fill level
while (true) {
    if (shiftreg ^ polarity) {
        int pos = __builtin_clz(shiftreg ^ polarity); // index of first change
        if (pos < bits_left) { // no false alert because of (inverted) pulled zeros?
            shiftreg <<= pos; // avoid shift by 32 (pos+1)
            shiftreg <<= 1; // shift away the change bit
            edge_pos += pos + 1;
            bits_left -= pos + 1;
            polarity = ~polarity; // now search for other edge
            printf("%s edge at %d\n", polarity ? "rising" : "falling", edge_pos);
            continue; // don't fetch new data, we may have more transitions
        }
    }
    // fetch new data, if available
    if (buf_index < buf_size) {
        shiftreg = buffer[buf_index++];
        edge_pos += bits_left;
        bits_left = 32;
    } else {
        break; // exhausted, exit the loop
    }
}

I think this is more easy to follow and most likely faster. Memory is only read once and not modified. It is flexible to ignore a heading train of 0 or 1, only triggers after the first change.

elehobica commented 1 year ago

Okay. I revised the code for read once. __builtin_clz() should be done later. Thanks.

elehobica commented 1 year ago

Next, replaced __builtin_ctz() with __builtin_clz(), but I can't see speed improvement for this part.

Add: Can't find clz from instruction set of cortex-m0+.

IDC-Dragon commented 1 year ago

Oh, how embarrassing! Now learned that not all ARM are alike, my bad, very sorry for abducting you. In that case, the de Bruijn method working from the right should be faster? I found no similar bithack for leading zeros, what could __builtin_clz() do to be on par?

elehobica commented 1 year ago

No worries. there should be some differences, but slight. in that case, I'd prefer simplest and easy-to-look one (%032b in case), thus __builtin_clz() is no bad choice.

IDC-Dragon commented 1 year ago

Hi, I had to neglect this for a while. Your FIFO clearing commit indeed fixed the bouncy measurement for me. Then I had to accomodate my main() for not just checking for != SPDIF_RX_STATUS_STABLE, it was again kicking spdif_rx_search() while in state SPDIF_RX_STATUS_WAITING_STABLE already...

Meanwhile I rigged up on test tooling, made a script to synthesize an S/PDIF frame for my arb generator. So now I could test with any payload/polarity/samplerate, artificial jitter, bad slopes, etc. All that because I was originally too lazy to solder in an inverter.

About the detection code: for the minimum you're considering the smallest pulse width, not the distance between equal edges. IMHO this has disadvantages, the number is smaller by a factor of 2 (losing resolution) and most of all mixing edges. I would only compare edges of the same direction, since slope and trigger level may be different. And one minimum would be enough. The tolerance of the detection is very small, just +/-1 digit absolute, no relative component. Why not rather base it on what the PIO reception would be able to sync? The limit values could be part of the samp_freq_array table, so they don't need to be recalculated at run time. We're lucky that __builtin_clz(0) returns 32, by definition it is undefined.

For 192 kHz, I sometimes get the reading: min0 = 256, max_edge0 = 0, min1 = 3, max_edge1 = 0

IDC-Dragon commented 1 year ago

I tested a same-edge variant and raised a PR for that, please consider. The outliers are gone.

elehobica commented 1 year ago

That's so good that you managed to eliminate outlier analysis values by using only edge intervals. As I commented in MR, I agree that measuring edge interval will be safer than pulse width. However it is only the case as long as we focus on analysis part. Speaking of decode part, PIO program relies on both edges and estimate the center timing from the edge. That means illegal pulse width could spoil the proper timing to be sampled and thus affect the stability of decoding. I think to secure certain amount of minimum pulse width is unavoidable. So my question is what kind of input circuit you are using before the rp2040 input? Which do you use, Coaxial or Toslink? As far as I use the schematic shown in Readme, I've not encountered such illegal pulse width. This is just from my curiosity.

IDC-Dragon commented 1 year ago

Currently the Pico setup is only on my bench, nearly perfect conditions. Input has been a TOSLINK receiver (DLR2180, the best I know for up to 192 kHz) behing 3m of fiber. Or lately the arb generator, with which I could simulate any condition. In real world (using "normal" hardware, no Pico) I've been struggling with reception stability. 192 kHz is marginal after 15m of fiber or coax. For fiber, I've been retrofitting my gear with above receivers and DLT2180 transmitters, got it barely working. For coax, I had to try various cables. RG159 works for shorter distances. The best I found for the longer distance is antenna cable. Behind cables and pulse transformers S/PDIF may look like triangles or sawtooth... For building receivers, I tried MAX3280 and SM712 protection diodes. But that part is slower than the datasheet tells, 192 kHz was not achieved. Now I'm (ab)using an M-LVDS receiver with hysteresis, SN65MLVD2. That has enough headroom.

elehobica commented 1 year ago

Thanks for the info. This includes more than just the reference. It sounds you have so gorgeous testing environment now. Maybe I should try testing with longer cable (so far maximum was 1m).

If we pursue the robustness in decoding, ideally following would need to be done:

handle only single (rising) edge of normal polarity, not "relying on both edges"
take care of polarity inversion outside (because PIO doesn't permit gpio inversion nor equivalent handling of jmp pin instruction for both polarity.)

It is the trade-off against simple peripheral circuit. For the time being, I'd like to see how much the current PIO program is practical for actual usage.

IDC-Dragon commented 1 year ago

Oh, it turns out the input pin polarity is programmable. Try this: gpio_set_inover(pin, GPIO_OVERRIDE_INVERT) Just tested, it indeed works. That removes the necessity for the inverted PIO code. Sorry for all the inverted thinking that went into it...

elehobica commented 1 year ago

Thanks for this. now PIO program for inverted is gone with my meaningless effort. haha.

IDC-Dragon commented 1 year ago

Oh, that was quick. It really simplifies things. And quite some API change, it now can freely run and resync without foreground attention, nice.

For the reception, yesterday I spent some time with your PIO code, think I got the idea, although I'd find it hard to write and debug such. The assignment of scratch registers confused me. You use Y for zeros although you could write in null, 1, no need for a register and intuitive to read. Then OSR is (ab)used as a scratch register for ones, although you would have Y left, it took me a while to understand that there is no special function behind it. Only two 1 bits are populated, I would have ecpected to load it with 0xFFFFFFFF as a generic source of ones.

I was also thinking about a way to only wait/sync for 0->1 edges and sample the rest in a timed way. But it requires more state, which in PIO world is code lines.

Still I'm pregnant with the idea of having only one PIO code and scale the clock according to the detected rate. Not sure if we understood each other fully. The raw PIO code doesn't have to be tuned for a particular rate, rather as fast as possible with the instructions inbetween the jmp pin or wait in sample points. Carefully lay out the code to have the sampling instructions evenly spaced, whatever the minimum interval may be (I think 4 clocks/instructions). The fractional clock scaling does the rest.

elehobica commented 1 year ago

Thanks for telling "in null" usage. now two registers are free, waiting to be used.

Add: I've also changed OSR value to 0xFFFFFFFF.

Abusing ISR (OSR) as configuration value when PIO plays only output (input) role is the (standard?) technique to spare X and Y registers, which is referred from rp2040 document, although those registers are still not used yet actually.

IDC-Dragon commented 1 year ago

Congrats to the release, and real world tolerances. Thank you for PIO clarification. Now I fully understand the code. There is a dead part, the wait0pre label and subsequent instruction is not used. The preamble decoding is nicely solved, M and W continue with regular bits, B just skips the zeros and syncs with the next block. I'm not sure if the push instruction really should be blocking. In the (unlikely) event of DMA overflow, you'd lose sync if you block. If not blocking, you'd just lose data. Question about the detection PIO code: you upfront wait for 31 rising edges, is that still intended?

elehobica commented 1 year ago

Hi, thanks for your contributions, too, not only for the pull requests but also for the remarks you made here.

Yes, your understanding for Sync codes handling is exact.

wait0pre part is the garbage, I should delete it, thanks.

As for push instruction:

basically, blocking by FIFO full is not expected under normal situation, however, if it could happen, to lose sync is intended, because it can indicate sub frame data could be incomplete.
Under the assumption no FIFO full blocking happens and incoming bitstream is correctly decoded as each sub frame payload (32bits), push instruction play a role to align sync 4 bits on LSB of 32bits even though decoding starts from the middle of the sub frame.

32 sets of waiting egdes before caturing are intended. It depends on frontend circuit, but they can block capturing when no signal incoming, thus it doesn't generate unnecessary DMA data flow and the analysis is not needed, either. No meaning of 32, it's just the maximum loop iteration which we can get by set instruction.

elehobica commented 1 year ago

I understand your viewpoint. If push is not blocking, PIO will not lost its position in the input bitstream context even though output data is collapsed. It seems true. However I'm not for sure if it's clearly better for PIO to keep tracing bitstream when FIFO is full.

IDC-Dragon commented 1 year ago

Hi, I just made a PR with 2 little improvements to ISR runtime.

Maybe a few comments about C-bit handling. Doing that as 32 bits is of course more efficient, but be aware of the endianess. The example code gives the same results nevertheless, because spdif_rx_get_c_bits() is called with a uint32_t destination. However, the API is byte oriented, to avoid misunderstandings you may want to consider adjusting that to 32 bits, too. (The current implementation of spdif_rx_get_c_bits() is a hand-made memcpy()...)

And another about parity checking. This is still quite expensive. I have in mind to potentially let the PIO code generate a comparison bit. That would require to e.g. have register X count with every emitted 1, and before the push write one bit of X, and reset it somewhere. The produced preamble part needs to be shortened to 3 bits to make room in the result word (subsequent data also will appear shifted). Care must be taken about M and W continuation, they also emit a 1. So when detecting B, increment as well, to have that all the same. Just a sketched idea, actual implementation may vary. (Did I describe this comprehensible? )

elehobica commented 1 year ago

Hello.

I put the comment on PR side for the part corresponding to PR.

As for the parity, if what I understand is correct, we don't need to calculate parity either in C code or in PIO program actually as long as current PIO program is used. If the parity is wrong, it affects the sync polarity of next sub frame. however, current PIO program doesn't accept the inversion of sync polarity in the consistent bitstream context, therefore it simply results to lose its decoding position in incoming bitstream, which will lead to lose block alignment. I've never tried to confirm this in realtime environment. If you have arbitrary SPDIF signal generator, I hope you'll be able to confirm it.

If we could see the possibility for output bits to be contaminated after PIO decoding, only C side parity checking will be able to do this.

Therefore, one option to eliminate the calculation time of parity would be to change _fullcheck behavior to have new mode to skip only P (include C bit processing) by its level.

Just speaking of the technical side of calculating parity in PIO program, I think it would be simple to replace P bit output with the result of calculated parity. I've tried it loosely before and given up due to the limitation of the steps of PIO program.

IDC-Dragon commented 1 year ago

I think there is a misunderstanding about parity: The parity bit is encoded always correct as the transmitter, else that won't be a valid bitstream. So you will not see the polarity flip within one encoding. Parity makes sure the encoder reaches the same polarity for each subframe, since the number of '1' bits is made even. Parity allows the receiver to check if it got the bits correct, more specifically if there was a single bit error (or even more specifically, if the number of bit errors is odd). One flipped bit can be detected, 2 flipped bits would unfortunately not, 3 would be again be detected, and so forth. For low probability bit errors this is fine.

elehobica commented 1 year ago

I don't think you and I are saying different. If the parity is calculated into one bit in PIO side, do you see it can have the same function with the parity checking calculated by C side?

IDC-Dragon commented 1 year ago

Yes, I think PIO code could have the same effect. Don't know if my idea is the best way: The position of the calculated bit would have to be after the received P output bit, so after we got everything. The C loop can then access both received and calculated bit. To make room for that extra bit would require to shorten the preamble and would make the payload appear shifted. Your idea was to replace the last bit, hmm, we received it already...

Edit: But i is indeed worth a thought if the PIO code would be able to produce output with single bit errors. A misinterpretation of one bit would make it look for different polarity of the next bit, we'd lose sync. This is different to a hardware-type PLL receiver.

elehobica commented 1 year ago

hmm, so this is the point which I haven't understood quite well. What I understood from your idea is:

PIO transmits: sync (should be shrinken) + data + original P + calculated bit
C compares: original P with calculated bit

I don't have your point why comparing original P with calculated bit could have same meaning with checking parity with actual data in receiver C side.

IDC-Dragon commented 1 year ago

Yes, that's what I propose to output. When the parity calculation includes the original P (which would be easier in PIO code) it would even be enough to let the C code check the new bit for being zero. The original received P bit would be just "for reference", or because it simplifies the PIO code.

elehobica / pico_spdif_rx

faster lock, etc #3