Severson-Group / AMDC-Firmware

Embedded system code (C and Verilog) which runs the AMDC Hardware
http://docs.amdc.dev/firmware
BSD 3-Clause "New" or "Revised" License
31 stars 5 forks source link

Problem with new Timing Manager sensor mode #399

Closed codecubepi closed 3 months ago

codecubepi commented 4 months ago

There is some frequency jitter when using the new timing manager mode.

The theory is that the problem lies in the truncation of the elapsed time value returned from the timing_manager_get_tick_delta() function which is called in scheduler_run().

The possible solution is to track all times in double.

We should also add a function to get the expected tick delta based on the enabled sensors and user ratio, other timing manager configurations, etc.

Latest updates (20 Jun) here to minimize scrolling:

codecubepi commented 3 months ago

I tried switching all the usec times in the scheduler to from uint32_t to double type, but I'm still seeing jitter in the frequencies of my 3-phase currents as previously observed. Also, now when I am running in the new mode with all sensors disabled, my currents all become constant (ie switching has frozen).

npetersen2 commented 3 months ago

Hmm. Ok.

The currents will stay frozen if the control task stops running, and so the FPGA holds the previous duty cycle outputs.

For debugging the jitter: can you use the approach in the advanced tutorial on task profiling to directly measure the callback loop time? Store it to a log variable and make a plot of it during operation. This will be very enlightening as to why the output fundamental currents are behaving so strangely.

codecubepi commented 3 months ago

When the new mode is operating correctly (with no sensors enabled or just the ADC/Encoder), I logged the following plots of loop time and run time:

image

image

It looks good, the loop time is hovering right around 100us, or a control frequency of 10 kHz.

When the new mode is operating correctly (with the AMDS enabled), I logged these plots of loop time and run time:

image

image

The loop time is now jumping between 100us and 200us. The currents I see on the scope are also 300 Hz instead of the commanded 600 Hz.

codecubepi commented 3 months ago

Additionally, after it's been running well in the new mode, I can still freeze the control task by disabling the sensors. I've been looking through the Timing Manager FPGA code, and it seems like it should still be trying to generate the interrupt, so I'm guessing this problem is also somewhere in the C code.

npetersen2 commented 3 months ago

Hmm interesting. The 300 Hz vs 600 Hz case makes sense: since the control code is running at about 2x slower in the second plots, the theta variable is updated 2x slower, so the output frequency becomes 1/2.

However, the question is still: why does enabling the AMDS cause it to "skip" a time slice for the task.

For next steps, I recommend looking into the scheduler and trying to understand why the skip is happening. I would log variables from the scheduler itself. For standard 10 kHz / 100 kHz settings, you should find that every time the scheduler is "un-paused" via the timing manager ISR, the elapsed_usec should increment by 100 usec and trigger the task to run again. It looks like it increments by 99.999 microseconds, which does not pass the check... so it skips the task and then runs it the next slice, so 200 usec. Eventually, it wraps around and then runs the task as desired (due to the phasing), so you see the 100 usec delta.

Can you log this from the scheduler and show this?

npetersen2 commented 3 months ago

So, really, you want to dump this function call output into a global variable for logging:

https://github.com/Severson-Group/AMDC-Firmware/blob/dc0fdd0ce74fe83c6990ae3c7bc2b5b64751a369/sdk/app_cpu1/common/sys/scheduler.c#L62

I bet the usec it increments is 99.9999 usec, instead of 100.

codecubepi commented 3 months ago

It looks like you're correct. I'm logging the tick_delta within scheduler_tick(), and these are the results I get:

image

Adding the following subtraction to the check in scheduler_run() removes the jitter we were seeing on the scope, since the task is being scheduled for execution at the correct loop frequency now...

image

image

I'm not sure that this "naive" subtraction solution is the best though. I'm open to other suggestions to fix.

npetersen2 commented 3 months ago

Wow nice work! It looks like this is really a floating point rounding issue / FPGA timer resolution issue. I like your solution, but would implement it something like:

if (is_close(usec_since_last_run, t->interval_usec, SCHEDULER_INTERVAL_THRESHOLD)) {
   // ... run task
}

where is_close(a,b,tol) implements the standard floating point "near-ness" comparison, i.e., see https://numpy.org/doc/stable/reference/generated/numpy.isclose.html

per the docs above: absolute(a - b) <= (atol + rtol * absolute(b)) which includes both absolute and relative tolerances.

I like that it works with a 50ns tolerance... I think this is totally acceptable to have this amount of tolerance.

TODO: we need to justify this 50ns value... why not 60ns? 40ns? 100ns? I think this can come down to the resolution of the FPGA timer, i.e., 5ns, and a factor of X. We just need to pick X (like your X=10 for 50ns), then run some tests over range of scheduler settings to ensure it works well. Maybe add a check in the tick function to make sure the issue isn't occurring.

In the data you collected, what is the max deviation from the expected 100 usec value? In your screenshot, it looks like 20ns. But, I would want to quantify this over many more samples... :)

codecubepi commented 3 months ago

@npetersen2

Yeah when I look for the max deviation over all my samples, I get 20ns.

So... our implementation of is_close() should have both an absolute and relative tolerance? Or just the SCHEDULER_INTERVAL_THRESHOLD?

Also, the version of is_close() in numpy will check for only return true, pass the check, and run the task if the usec_since_last_run and task interval are close (obviously). In the current implementation, the task will still run even if the usec_since_last_run is much greater than the interval. Is this an okay change to the current behavior?

npetersen2 commented 3 months ago

Ah good point. I think it should just be an absolute tolerance of some factor of 20ns, maybe 60ns. And also run tasks with greater than the interval time (like the current code)… so this is exactly the code you wrote above :)

Please just make the code very clear about this and explain why we are adding this fudge factor. And make the factor a define or const etc, no magic numbers

codecubepi commented 3 months ago

@npetersen2

The jitter issue seems fixed by adding the timing tolerance. Let me know if you think these comments are good enough:

https://github.com/Severson-Group/AMDC-Firmware/blob/7518479cf7ef6994b676f00023b28e2e181b9ad8/sdk/app_cpu1/common/sys/scheduler.c#L197-L201

https://github.com/Severson-Group/AMDC-Firmware/blob/7518479cf7ef6994b676f00023b28e2e181b9ad8/sdk/app_cpu1/common/sys/scheduler.h#L16-L19

However, I am still dealing with the currents freezing if I disable the sensors in the timing manager while running in the new mode.

Stepping through the debugger, I can tell that the control task callback is being executed, and the duty ratios are being updated and written via the pwm_set_duty_raw() function. So I am really not sure why I see no change on the scope.

npetersen2 commented 3 months ago

Does the PWM duty cycle flush the output correctly when in the new mode, and when no sensors are enabled?

It sounds like everything appears to work correctly, but the duty cycle update gets lost after writing to the fpga registers?

codecubepi commented 3 months ago

No.. it seems like the new mode refuses to update the duty ratios regardless of whether or not sensors are enabled. The blink task still works. And yeah, everything with the scheduler, control callback, and register write appear to work. Just that the duty cycle update gets lost after writing to the fpga registers.

codecubepi commented 3 months ago

And I've discovered a new problem... sometimes the logging command interface will fail with INVALID_ARGUMENTS, and sometimes it works just fine:

image

image

I'm not sure if this is also because of the conversion of times from uint32_t to double or what.

npetersen2 commented 3 months ago

Yea, I have seen this a lot myself. I am not sure it is just with the log commands---I think it might be with the entire command processing pipeline... some lurking bug that, every 50-100 commands, it fails with invalid arguments... I have not ever bothered to debug it... but would be fairly easy to spam the AMDC with commands in python and just put a breakpoint in the command handler return of invalid, then take a look at the argv/argc variables and see why/how they look funny. Or, just ignore it for now

codecubepi commented 3 months ago

Some results from lab debugging:

The problem with the duty ratios no longer updating is due to the following faulty logic:

The solution was to remove all_done's reliance on sensors enabled. This can be done safely now that all sensors are done upon reset.

Some Plots Gathered

Control task running with new timing manager mode (which runs the scheduler ISR when all sensors are done).

For the first half of the time, no sensors are enabled (which is the same as the "legacy" behaviour). Halfway through, the AMDS is enabled in the timing manger.

The plot shows loop time for 100 kHz switching and 10 kHz control. When the AMDS is enabled, the loop time jumps from 100us to 110us for a single cycle because the timing manager is now waiting for the AMDS to complete sampling. All future loop times return to the expected 100us.

image

We also found this weird plot of the control task run time decreasing when the AMDS is enabled in the timing manager:

image

codecubepi commented 3 months ago

Plots of latest issue: disabling and re-enabling sensors during operation leads to inconsistent scheduler_tick() loop time:

image

image

anirudhupadhyaya commented 3 months ago

Hey @codecubepi , the fix for logging was in common/main.c. The timing_manager_init() needs to come before user_apps_init(), so that any timing settings set by the user (PWM sw frew, sync. TM ratio, etc) are not overwritten by the defaults.

codecubepi commented 3 months ago

Thanks @anirudhupadhyaya !

codecubepi commented 3 months ago

Wrapping up this issue's conversation...

With the control task jitter and general scheduler/logger bugs resolved, I'm going to close this issue.

For the problem with the sensor enabling/disabling leading to inconsistent loop time, see #402. I think the scope of that is big enough to warrant its own issue. And I'm tired of scrolling through this conversation.