mhv-shared commented 7 years ago

Do you plan to add support for LCD + encoder? Or will control always be done through octoprint?

marcio-ao commented 6 years ago

Anyone know Chinese? I'm curious what Klipper's first words were...

KevinOConnor commented 6 years ago

On Wed, Jan 17, 2018 at 08:50:35AM -0800, Marcio Teixeira wrote:

@KevinOConnor : That worked! I modified "lcd_st7920.py" to turn on the blinking text cursor. You can see it on the upper left. There probably still are a few timing issues as I am also getting some garbage Chinese characters. But hey, it's a first step! Where would I go to alter the timing from 72u to something longer?

That's in src/lcd_st7920.c:st7920_xmit() - change the timer_from_us(72) to something else. I doubt that's the issue though - I've probably messed up the protocol somewhere.

The demo "work_event" code was supposed to move hash marks around the display - I guess that's not happening?

-Kevin

KevinOConnor commented 6 years ago

On Wed, Jan 17, 2018 at 12:00:33PM -0500, Kevin O'Connor wrote:

On Wed, Jan 17, 2018 at 08:50:35AM -0800, Marcio Teixeira wrote:

@KevinOConnor : That worked! I modified "lcd_st7920.py" to turn on the blinking text cursor. You can see it on the upper left. There probably still are a few timing issues as I am also getting some garbage Chinese characters. But hey, it's a first step! Where would I go to alter the timing from 72u to something longer?

That's in src/lcd_st7920.c:st7920_xmit() - change the timer_from_us(72) to something else. I doubt that's the issue though - I've probably messed up the protocol somewhere.

Indeed - the SYNC_DATA bits were not correct. I pushed a change to the branch - it requires a reflash of the mcu code.

-Kevin

marcio-ao commented 6 years ago

@KevinOConnor: I wasn't getting anything when I tried your code.

I believe you may have misunderstood the protocol a bit. Writing data to the ST7920 is independent of setting the address. There is an address register on the chip and you use special commands to populate that address register. Once this is done, you can write one or more bytes using a separate write_data command and the chip automatically updates the address counter.

The only difference between a regular command and the "write_data" command is the "rs" bit. There is no sense in having the two being very different in implementation. At minimum, "send_data" should be just like "send_command" with the exception of the rs bit being set. Alternatively, you could have "send_data" take multiple bytes. This is the one-byte-at-a-time version:

send_command(self, cmd)    % Sends a command:  11111000xxxx0000xxxx0000
send_data(self, data)             % Sends a byte:           11111010xxxx0000xxxx0000

Writing multiple bytes at an address would look like this:

   send_command(self, 0b10000000 | (addr & 0b00111111))   # set ddram address
   send_data(self, 0xFF);                                                          # write a data byte
   send_data(self, 0x00);                                                          # write a data byte
   send_data(self, 0xFF);                                                          # write a data byte
   send_data(self, 0x00);                                                          # write a data byte

Alternatively, if send_data could take multiple bytes, then it could look like this:

   send_command(self, 0b10000000 | (addr & 0b00111111))   # set ddram address
   send_data(self, [0xFF,0x00,0xFF,0x00]);                               # write multiple bytes

I think it is important that the writing of the address be independent of writing of the bytes because the graphics mode uses a peculiar syntax where you have to write the y and x addresses in sequence. It looks like this:

    def lcd_set_gdram_address(self, x, y):
        self.send_command(0b10000000 | (y    & 0b01111111))
        self.send_command(0b10000000 | (x    & 0b00001111))

marcio-ao commented 6 years ago

@KevinOConnor: Ahhh! Nice! I just discovered something that is not in the datasheet at all. It is possible to send multiple bytes in a row without the sync bits. The sync, rs and rw bits are only needed at the start of the transmission. So, if we want to send a collection of bytes aaaaaaaa, bbbbbbbb and cccccccc as data, it looks like this:

11111010aaaa0000aaaa0000[delay 72u]bbbb0000bbbb0000[delay 72u]cccc0000cccc0000...

I'm thinking it may also be possible to do it for commands as well. This will increase the max throughput by a little bit.

KevinOConnor commented 6 years ago

On Wed, Jan 17, 2018 at 09:20:29AM -0800, Marcio Teixeira wrote:

@KevinOConnor: I wasn't getting anything when I tried your code.

Okay, thanks. It still doesn't work with the SYNC_DATA fix?

[...]

Alternatively, if send_data could take multiple bytes, then it could look like this:
   send_command(self, 0b10000000 | (addr & 0b00111111))   # set ddram address
   send_data(self, [0xFF,0x00,0xFF,0x00]);                               # write multiple bytes

That's what send_data() tries to do now - if multiple send_data() commands are sent in succession with incremental addresses, then send_data() will automatically combine them into a single burst of data. The idea was to try and reduce the amount of serial bandwidth for the common case where multiple character cells are written at once.

I think it is important that the writing of the address be independent of writing of the bytes because the graphics mode uses a peculiar syntax where you have to write the y and x addresses in sequence. It looks like this:
    def lcd_set_gdram_address(self, x, y):
        self.send_command(0b10000000 | (y    & 0b01111111))
        self.send_command(0b10000000 | (x    & 0b00001111))

I was thinking that could be done with something like:

def send_graphics_data(self, x, y, data): if y == self.last_gfx_y and x == self.last_gfx_x + len(self.pending_data)//2: self.pending_data.extend([(data >> 8) & 0xff, data & 0xff]) return self.flush_data()

Send gfx position update

self.send_command(... extended mode ...)
self.send_command(... y position ...)
self.send_command(... x position ...)
self.send_command(... basic mode ...)
self.pending_data.extend([(data >> 8) & 0xff, data & 0xff])

That is, instead of having the drawing code try to batch up writes, just have it write out to arbitrary locations and have the send_x_code() batch it up on transmit.

Granted, the current st7920_burst mcu command likely needs more work before it will work well with the above.

-Kevin

KevinOConnor commented 6 years ago

On Wed, Jan 17, 2018 at 09:38:34AM -0800, Marcio Teixeira wrote:

@KevinOConnor: Ahhh! Nice! I just discovered something that is not in the datasheet at all. It is possible to send multiple bytes in a row without the sync bits. The sync, rs and rw bits are only needed at the start of the transmission. So, if we want to send a collection of bytes aaaaaaaa, bbbbbbbb and cccccccc as data, it looks like this:

11111010aaaa0000aaaa0000[delay 72u]bbbb0000bbbb0000[delay 72u]cccc0000cccc0000...

I'm thinking it may also be possible to do it for commands as well. This will increase the max throughput by a little bit.

Interesting.

Two other questions I'd have:

The chip needs 72us to run the command, but since the command takes over 14us to get to the chip (or 9 if no sync), could the transmission start earlier - thus making the delay 58us (or 63us)?
Does the chip really need to go back to "basic instruction set" before writing to the graphics ram? Or, could we just do: set_gfx_address(y), set_gfx_address(x), write_data(data)?

-Kevin

marcio-ao commented 6 years ago

@KevinOConnor: At this point, I'm not sure. I think I have to do more experimentation. The datasheet is quite poor and leaves out some vital information.

It appears as if the 72u delay is necessary in some cases (the datasheet implies it is needed after all commands), but it appears to be possible to write multiple bytes following a single sync and this isn't mentioned in the datasheet at all. I think I need to do more experimentation with this to get a better handle on what is really necessary before I ask you to make any more changes :)

marcio-ao commented 6 years ago

@KevinOConnor: Okay, I seem to have wrapped my head around this. It looks like commands need to be separated by 72u, but memory writes do not. It is best to think of the sync and the data as two separate actions:

void lcd_sync(bool rs, bool rw) {
  LCD_SEND(1); // Sync 1
  LCD_SEND(1); // Sync 2
  LCD_SEND(1); // Sync 3
  LCD_SEND(1); // Sync 4
  LCD_SEND(1); // Sync 5
  LCD_SEND(rw);
  LCD_SEND(rs);
  LCD_SEND(0);
}

void lcd_data(uint8_t data) {
  LCD_SEND(data & 0b10000000);
  LCD_SEND(data & 0b01000000);
  LCD_SEND(data & 0b00100000);
  LCD_SEND(data & 0b00010000);
  LCD_SEND(0);
  LCD_SEND(0);
  LCD_SEND(0);
  LCD_SEND(0);
  LCD_SEND(data & 0b00001000);
  LCD_SEND(data & 0b00000100);
  LCD_SEND(data & 0b00000010);
  LCD_SEND(data & 0b00000001);
  LCD_SEND(0);
  LCD_SEND(0);
  LCD_SEND(0);
  LCD_SEND(0);
}

A command is sync(rs=0)+data and they must be separated by 72u:

void lcd_cmd(uint8_t data) {
  static unsigned long last_command = 0;
  while(micros() - last_command < 72);
  lcd_sync(0, 0);
  lcd_data(data);
  last_command = micros();
}

A write consists of a sync(rs=1)+data+data+...+data and there is no 72u waiting requirement. I chose to implement it in my Arduino code as a "write_begin", followed by "write_byte" or "write_word":

void lcd_write_begin() {
  lcd_sync(1,0);
}

void lcd_write_byte(uint8_t w) {
  lcd_data(w & 0xFF);
}

void lcd_write_word(uint16_t w) {
  lcd_data((w >> 8) & 0xFF);
  lcd_data((w >> 0) & 0xFF);
}

This seems to work quite well. Now, I just need to see if I can understand how to implement this using the primitives you placed in Klipper.

marcio-ao commented 6 years ago

@KevinOConnor: I tried your original example code again and it's not working. I'm still not sure what is up.

I'm trying to understand how "st7920_task" works. As far as I can tell, you are always sending the first byte as a command, then subsequent bytes as data? Is this correct?

marcio-ao commented 6 years ago

As far as I can tell, the first byte, the command byte is being correctly delivered. But subsequent data bytes are not.

KevinOConnor commented 6 years ago

On Wed, Jan 17, 2018 at 07:49:38PM +0000, Marcio Teixeira wrote:

As far as I can tell, the first byte, the command byte is being correctly delivered. But subsequent data bytes are not.

Can you confirm you did a "git pull" and reflashed with the SYNC_DATA fix?

-Kevin

marcio-ao commented 6 years ago

@KevinOConnor : Yes

marcio-ao commented 6 years ago

@KevinOConnor: In "void st7920_task(void)", the following line seems weird to me:

s->next_pos = s->buffer[0] + 1;

It seems like you are adding the value of a character in the buffer to the index, this does not seem right.

marcio-ao commented 6 years ago

Nevermind, I think you are embedding length bytes in the buffer, in addition to characters. Makes a little more sense to me now.

KevinOConnor commented 6 years ago

On Wed, Jan 17, 2018 at 12:45:04PM -0800, Marcio Teixeira wrote:

@KevinOConnor: In "void st7920_task(void)", the following line seems weird to me:

s->next_pos = s->buffer[0] + 1;

It seems like you are adding the value of a character in the buffer to the index, this does not seem right.

That was intentional - the idea was to be able to store multiple st7920_burst commands in the local buffer. So, we need to know how many bytes is in the next burst.

I think I found the problem - I messed up with the C integer promotion rules (bleh). Can you do a "git pull" and retry?

FYI, the console.py tool may help with these types of unit tests. You can reset the mcu and run:

~/klippy-env/bin/python ~/klipper/klippy/console.py /dev/ttyACM0 250000

and then enter commands like the following:

allocate_oids count=1 config_st7920 oid=0 sclk_pin=PJ2 sid_pin=PG3 set_digital_out pin=PG4 value=1 finalize_config crc=0

st7920_burst cmd_then_data=802323 st7920_burst cmd_then_data=8123232323

The format of cmd_then_data is a hex string (two characters per byte). The first byte should go out as a command and any subsequent bytes should go out as data.

-Kevin

marcio-ao commented 6 years ago

@KevinOConnor: The console was helpful. Things are behaving weird, though. Sometimes I am able to write an address and a few characters to the display; but at other times, the whole command has no effect. Sometimes a lot less than I requested actually gets written.

Anyhow, I do feel we are getting close.

marcio-ao commented 6 years ago

Here's a thought: Is having the ability to enqueue multiple commands really necessary? That is adding a lot of complexity to the "lcd_st7920.c" code. My thoughts are if the Python code is running fast enough that it is sending a new command before the last one is done being sent to the LCD, then it's going to chew right through the 32 byte buffer in no time. I would suggest just buffering one command. That will simplify the code since you won't have to have the end_pos, cur_pos, next_pos indices.

marcio-ao commented 6 years ago

Also, if you do want to queue up multiple commands, a circular buffer might be a better idea. You wouldn't need to do "memcpy" to move data from the end of the buffer to the head of the buffer to free up space.

KevinOConnor commented 6 years ago

On Wed, Jan 17, 2018 at 09:41:42PM +0000, Marcio Teixeira wrote:

@KevinOConnor: The console was helpful. Things are behaving weird, though. Sometimes I am able to write an address and a few characters to the display; but at other times, the whole command has no effect. Sometimes a lot less than I requested actually gets written.

There was another error in the C code - I was setting the SID after setting SCLK high when it should be done before. I've updated the work-lcd-20180115 branch.

On Wed, Jan 17, 2018 at 09:53:44PM +0000, Marcio Teixeira wrote:

Here's a thought: Is having the ability to enqueue multiple commands really necessary? That is adding a lot of complexity to the "lcd_st7920.c" code. My thoughts are if the Python code is running fast enough that it is sending a new command before the last one is done being sent to the LCD, then it's going to chew right through the 32 byte buffer in no time. I would suggest just buffering one command. That will simplify the code since you won't have to have the end_pos, cur_pos, next_pos indices.

Yeah, I'd do things diffently now that I have a different understanding of the protocol. In particular..

On Wed, Jan 17, 2018 at 10:46:32AM -0800, Marcio Teixeira wrote:

@KevinOConnor: Okay, I seem to have wrapped my head around this. It looks like commands need to be separated by 72u, but memory writes do not. It is best to think of the sync and the write as two separate actions: [...] A write consists of a sync(rs=1)+data+data+...+data and there is no 72u waiting requirement. I chose to implement it in my Arduino code as a "write_begin", followed by "write_byte" or "write_word":

That's interesting as it means that data writes are very fast. Instead of a byte taking ~87us it only takes ~10us. That also means that it's probably not necessary to buffer pure data writes, as the command handler can probably just directly write them out (assuming there are no commands pending). It also means that the python code should probably do more bulk writing than pointer moving. (For example, if 3 bytes are updated on line 1 col 4 and 5 bytes are updated on line 1 col 10, then it's probably faster to send the whole line than to try and update just what's changed.)

It might make sense for the python code to create a "framebuffer", have the drawing code update the framebuffer, and then on each update check what's changed in the framebuffer and send those differences to the mcu. We could keep a framebuffer for both the text memory and for the graphics memory.

-Kevin

marcio-ao commented 6 years ago

@KevinOConnor : I think I now understand the code well enough that I might be able to try some things on my end. Maybe I'll try implementing the faster memory write.

I did run into one thing I wanted to ask you about:

static void
st7920_xmit(struct st7920 *s, uint32_t data)
{
    struct gpio_out sclk = s->sclk, sid = s->sid;
    uint8_t i, last_b = 0;
    for (i=0; i<3; i++) {
        uint8_t b = data >> 16, j;
        data <<= 8;
        for (j=0; j<8; j++) {
            gpio_out_toggle(sclk);
            if ((b ^ last_b) & 0x80)
                gpio_out_toggle(sid);
            gpio_out_toggle(sclk);
            last_b = b;
            b <<= 1;
        }
    }
    s->nexttime = timer_read_time() + timer_from_us(100);
    if (last_b & 0x80)
        gpio_out_toggle(sid);
}

I kind of did a double-take once I noticed you were going to the trouble of tracking of "last_b" and toggling the state of "sid" when needed, rather than simply setting the state to "on" or "off". On the Arduino, setting an output bit uses only one instruction (sbi/cbi) and from my experience the compiler does a fine job of optimizing something like PORTD = PORTD | 0b00000100 down to that instruction. What you have involves several comparisons and branches and thus is way more than one instruction. Digging down further, I found that in "gpio.c", "gpio_out_write" saves the IRQ state, once again, I am perplexed as to why this is needed. I think the code could be made a lot more readable if it was not necessary to use a toggle -- and it would be faster to boot!

marcio-ao commented 6 years ago

@KevinOConnor: There was another error in the C code - I was setting the SID after setting SCLK high when it should be done before.

Ah, yes. I see that now. Also, according to the datasheet, the transition period of SCLK can be no less than 312ns. This is roughly 5 Arduino instructions. Since you are doing a function call to toggle, it's probably meeting that requirement, but if that function were to be inlined, it may be necessary to insert NOPs. This is what I am using in my Arduino test code:

define DELAY_312ns asm("nop\n\tnop\n\tnop\n\tnop\n\tnop\n\t");

marcio-ao commented 6 years ago

@KevinOConnor : I think I have it working now. There was yet another undocumented behavior that was tripping me up (and which was tripping up your original test code). It turns out that you can write bytes to DRAM all you want, but the display will only refresh when you issue a non-write command (i.e. something with rs=0). Since my original demo all had an animation loop which intermixed RAM data writes and non-write commands, I was forcing refreshes all the time and I never noticed this issue.

Anyhow, now that I got the basics down, I'll try optimizing things a bit and adding a more interesting interface.

andreq commented 6 years ago

Just wanna say : Great freaking work guys. I didn't think it was worth it at first, but this was figured out quite nicely reading this thread :)

KevinOConnor commented 6 years ago

On Thu, Jan 18, 2018 at 06:40:11AM -0800, Marcio Teixeira wrote:

I did run into one thing I wanted to ask you about: [...] I kind of did a double-take once I noticed you were going to the trouble of tracking of "last_b" and toggling the state of "sid" when needed, rather than simply setting the state to "on" or "off". On the Arduino, setting an output bit uses only one instruction (sbi/cbi) and from my experience the compiler does a fine job of optimizing something like PORTD = PORTD | 0b00000100 down to that instruction. What you have involves several comparisons and branches and thus is way more than one instruction. Digging down further, I found that in "gpio.c", "gpio_out_write" saves the IRQ state, once again, I am perplexed as to why this is needed. I think the code could be made a lot more readable if it was not necessary to use a toggle -- and it would be faster to boot!

The sbi/cbi instruction only works if the pin is known at compile time and it only works for some pins. It's 1 cpu cycle in this case, otherwise it's a minimum of 5 cycles (but in practice likely 8+ to make it atomic).

Some AVR printer firmwares go to great lengths to compile in the pin definitions to get this 1 cycle optimization. In practice, though, doing so leads to a significant pessimization as the complexity of supporting it tends to eliminate significant high-level optimizations. (It's also a major pain for end-users as they then have to recompile the flash on any pin change.) Not bothering with this minor optimization is one way that Klipper is able to achieve much higher real-world performance.

On the AVR, a pin toggle is a minimum of 2 cycles. It's actually faster than doing a basic gpio_out_write() in the generic case. Note that gcc does an excellent job of inlining (with -fwhole-program), so gpio_out_toggle() is not a function call. I organized the st7920_xmit() code, so that in practice, the gcc generated code is at least 6 cycles per clock toggle. That is, as a trick, I'm updating the last_b in the time that would otherwise be nops.

-Kevin

marcio-ao commented 6 years ago

@KevinOConnor: Sounds like you've thought this through well from a performance standpoint. From a readability standpoint, however, it seems like it could be improved. Since there is 300ns available between the the SCLK, wouldn't gpio_out_write() be an option be a decent candidate? It doesn't seem to me like disabling interrupts is strictly necessary in this case, so maybe even a version of gpio_out_write() that allowed interrupts?

On another topic, on an earlier comment you said you didn't have hardware to test the LCD display. If it would save both of us time in the back-and-forth, my company would be able to provide with some hardware for testing. If you are interested, please write me at marcio@alephobjects.com and we can work out the details.

KevinOConnor commented 6 years ago

On Thu, Jan 18, 2018 at 05:29:52PM +0000, Marcio Teixeira wrote:

@KevinOConnor: Sounds like you've thought this through well from a performance standpoint. From a readability standpoint, however, it seems like it could be improved. Since there is 300ns available between the the SCLK, wouldn't gpio_out_write() be an option be a decent candidate? It doesn't seem to me like disabling interrupts is strictly necessary in this case, so maybe even a version of gpio_out_write() that allowed interrupts?

FYI, you definitely need to disable irqs for the generic AVR gpio write - the write involves a read, update, write process - for example to set PA5 would be: reg=PORTA; reg|=0x20; PORTA=reg;

If interrupts are not disabled, then there is a possibility that an interrupt could occur after the register read (eg, reg=PORTA) and change an unrelated pin (eg, PA3) which would cause the register write (eg, PORTA=reg) to incorrectly alter this other pin.

That aside, I agree the st7920 code could be improved. It's just demo code.

-Kevin

marcio-ao commented 6 years ago

That aside, I agree the st7920 code could be improved. It's just demo code.

Ah. Good point. I'm sorry for being overly critical. I do appreciate your help with this and your willingness to explain your choices.

marcio-ao commented 6 years ago

@KevinOConnor : Just a small update. I got the basic graphics display to show up. I made a few modifications to "lcd_st7920.c" so that up to 15 data bytes are streamed at a time after a single SYNC_DATA byte. I also modified things so it is possible to send data bytes without an associated command byte by setting the command to zero. This is useful since now the Python code no longer needs to keep and update a shadow address counter. This simplifies things a lot.

I would like to add encoder support so I can make a basic menu interface. The way it would work is the FW would poll three input lines (two for the two encoder gray-code bits, one for the push button), count pulses and keep track of the encoder position in an uint8_t variable. For an encoder of 32 steps per rotation, assuming a maximum rotation of 60 RPM, this would be 1920 pulses per second, so the FW task would need to run every ~500us.

Could you add an "lcd_encoder_gray_code.c" to the FW and set it up with the following additional configuration block:

[encoder_gray_code] lcd_en_gray_code_1 = J1 lcd_en_gray_code_2 = J2 lcd_en_btn = H6

I would also need to have the ability to call a C function from python to retrieve the current value of the uint8_t, such as "encoder_get_position".

Now, I could do all this in the existing "lcd_st7920.c", but I think it may make sense to have it separate "encoder" FW entity since not all LCD modules have an encoder. And there may be other LCD modules out there which are not the ST7920 which may need an encoder.

As for the Python code, I think "lcd_st7920" could initialize the encoder if it was configured, so I think there is no need for a corresponding "encoder.py"

Thoughts?

-- Marcio

KevinOConnor commented 6 years ago

On Fri, Jan 19, 2018 at 06:36:38AM -0800, Marcio Teixeira wrote:

@KevinOConnor : Just a small update. I got the basic graphics display to show up. I made a few modifications to "lcd_st7920.c" so that up to 15 data bytes are streamed at a time after a single SYNC_DATA byte. I also modified things so it is possible to send data bytes without an associated command byte by setting the command to zero. This is useful since now the Python code no longer needs to keep and update a shadow address counter. This simplifies things a lot.

Great!

I would like to add encoder support so I can make a basic menu interface. The way it would work is the FW would poll three input lines (two for the two encoder gray-code bits, one for the push button), count pulses and keep track of the encoder position in an uint8_t variable. For an encoder of 32 steps per rotation, assuming a maximum rotation of 60 RPM, this would be 1920 pulses per second, so the FW task would need to run every ~500us.

Is it necessary to do button de-bouncing in the software?

Does Marlin (and similar) regularly poll the gpios like the above, or is some other mechanism (like irqs) used?

Could you add an "lcd_encoder.c" to the FW and set it up with the following additional configuration block:

Sure. I don't think I'll be able to look at it until the end of next week though.

-Kevin

marcio-ao commented 6 years ago

@KevinOConnor: I wanted to send over to you what I have. It already draws the full LCD status screen. Could you maybe look into merging this in and populating the following fields with live data?

        self.extruder_count      = 1   # Can be 1 or 2
        self.extruder_1_temp     = 100
        self.extruder_1_target   = 210
        self.extruder_2_temp     = 178
        self.extruder_2_target   = 205
        self.feedrate_percentage = 100
        self.bed_temp            = 0
        self.bed_target          = 110
        self.print_progress      = 0
        self.print_time_hrs      = 0
        self.print_time_min      = 0
        self.fan_percentage      = 90
        self.position_x          = 0
        self.position_y          = 0
        self.position_z          = 0

Attached are all the files I modified.

lcd_st7920.c.txt lcd_st7920.py.txt lulzbot-taz6-2017.cfg.txt

Also, for the moment, if you look "lcd_st7920.c", you will see I have a variable called QUEUE_MULTIPLE_COMMANDS. This was something I was working on, but so far it was causing Klipper to crash (with a "shutdown" message), so right now I am only sending one command at a time. This works well and I am able to draw the entire interface.

marcio-ao commented 6 years ago

Sure. I don't think I'll be able to look at it until the end of next week though.

Okay. I need to take a break from Klipper anyway, as I've been asked to get the new interface integrated into Marlin so we can have it for our next project launch. For now, the code above does work, if anyone else wants to play with it.

marciot commented 6 years ago

@KevinOConnor : Pardon me jumping in after-hours here (from a different account), but I was thinking about what you said earlier about how in Klipper setting IO pins is a non-atomic operation. If I understood correctly your explanation, this was because Klipper uses run-time configuration of pins, whereas other FW achieve atomic pin setting by using C macros and compile-time configuration.

Well, I just had an idea that may combine the best of both worlds. I wanted to run it by you before I forget. What if I/O operations in Klipper were done like this:

void set_io_pin(uint8_t pin) {
   switch (pin) {
      case 0: PORTA |= PORTA & (0b000000001); break;
      case 1: PORTA |= PORTA & (0b000000010); break;
      case 2: PORTA |= PORTA & (0b000000100); break;
         ...
      case 8: PORTB |= PORTB & (0b000000001); break;
      case 9: PORTB |= PORTB & (0b000000010); break;
         ...
      case 16: PORTC |= PORTC & (0b000000001); break;
      case 17: PORTC |= PORTC & (0b000000010); break;
      ...
   }
}

The idea here is you can still have run-time configuration of pins and yet have atomic access to the GPIO pins. The downside, of course, is that this function is enormous and is thus non-inlineable, but even with the function call overhead, I suspect it may be quite efficient. I don't know enough Atmel assembly to actually count instructions, but here are some assumptions:

4-7 instructions cycles for function call, mainly pushing address onto stack, jumping to routine and doing the reverse on exit.
1-2 instructions for the switch statement, which would simply be a jump table.
1 instruction for setting the pin.

So, maybe 10 instructions to set a pin in a run-time configurable, atomic fashion. Not bad. And the cost is just a few hundred instructions in Flash, but Flash is quite plentiful and Klipper uses very little of it.

Thoughts?

-- Marcio

KevinOConnor commented 6 years ago

On Fri, Jan 19, 2018 at 08:17:57PM -0800, Marcio T. wrote:

@KevinOConnor : Pardon me jumping in after-hours here (from a different account), but I was thinking about what you said earlier about how in Klipper setting IO pins is a non-atomic operation. If I understand correctly your explanation, this was because Klipper uses run-time configuration of pins, whereas other FW achieve atomic pin setting by using C macros and compile-time configuration.

Well, I just had an idea that may combine the best of both worlds and I wanted to run it by you before I forget. What if I/O operations in Klipper were done like this: [...] The idea here is you can still have run-time configuration of pins and yet have atomic access to the GPIO pins. The downside, of course, is that this function is enormous and is thus non-inlineable, but even with the function call overhead, I suspect it may be quite efficient. I don't know enough Atmel assembly to actually count instructions, but here are some assumptions:

4-7 instructions cycles for function call, mainly pushing address onto stack, jumping to routine and doing the reverse on exit.

1-2 instructions for the switch statement, which would simply be a jump table.

The switch statement is going to be significantly higher on the AVR - the AVR is really slow at these types of things. I'd guess around 20 instructions for it. Also, keep in mind that the sbi/cbi trick only works for some pins - in particular, there is no way to atomically update PORTH-PORTL on the atmega2560.

If you look at the generated code (avr-objdump -d out/klipper.elf | less) and find the code for gpio_out_write() you can count the clock cycles and see it uses 12 instructions (not including call/ret overhead).

One thing to keep in mind, is that the only performance sensitive part of the micro-controller code is stepper_event() (and the timer code paths that lead to it). Everything else is largely irrelevant. So, for the gpio writing case, the only thing that really matters for performance is the cost of the step pulse. In that case, the gpio_out_toggle() is actually more convenient than gpio_out_write(), because using the toggle avoids having to code for whether or not the step pin is trigger on rising edge or falling edge.

Cheers, -Kevin

marciot commented 6 years ago

@KevinOConnor : Fascinating! It does appear that the switch statement is far less efficient than I had presumed. I found a document online that gave the following code:

clr r0
ldi ZL, low(jumpTable) 
ldi ZH, high(jumpTable) // Z now points at jumpTable
add ZL, r16 
adc ZH, r0 // add value of r16 to Z
ijmp ; 2 cycles
. . .
jumpTable:
rjmp isZero ; 2 cycles
rjmp isOne
rjmp isTwo

There is presumably one additional rjmp to get out of the isOne, isTwo... cases, so that brings the count up to 11 cycles for the switch statement. It looks then that this would be exactly a break even with "gpio_out_write" for the I/O registers that allowed a single-cycle sbi/cbi -- and a big penalty for the ones that did not. Doh!

So the problem with the Atmel instruction set is that they provide an indirect jump with a 16-bit absolute base address which exactly the opposite of the indirect jump with an 8-bit relative offset which would have made the switch hyper-efficient. Add to this the fact that all JMP instructions are two cycles and it's pretty much a bust.

the gpio_out_toggle() is actually more convenient than gpio_out_write(), because using the toggle avoids having to code for whether or not the step pin is trigger on rising edge or falling edge.

That's clever, I concur that for SCLK or steps, gpio_out_toggle() is the best solution. The only place I was really troubled by it was in "st7920_xmit" for "sid" where it is necessary to keep track of the last state of the bit. But I now understand there are good reasons why gpio_write_out() is slower and why switching to it would wipe out the small benefit of not having to keep track of the last state. This is certainly one of this cases where a local optimization becomes a global pessimisation.

-- Marcio

marcio-ao commented 6 years ago

@KevinOConnor:

Is it necessary to do button de-bouncing in the software?

Yes, it will be necessary.

Does Marlin (and similar) regularly poll the gpios like the above, or is some other mechanism (like irqs) used?

It polls. Apparently not all controller boards put the encoder on pins that are interrupt capable.

This week I will be focusing on other tasks, so no rush to get this integrated into Klipper. I'll pick up again once you have had a chance to integrate the changes I made to the lcd code.

Thanks!

-- Marcio

KevinOConnor commented 6 years ago

On Fri, Jan 19, 2018 at 11:20:43PM +0000, Marcio Teixeira wrote:

@KevinOConnor: I wanted to send over to you what I have. It already draws the full LCD status screen. Could you maybe look into merging this is and populating the following fields with live data?

Hi Marcio. I got a chance to look through the code you posted. Some comments below.

        self.extruder_count      = 1   # Can be 1 or 2
        self.extruder_1_temp     = 100
        self.extruder_1_target   = 210
        self.extruder_2_temp     = 178
        self.extruder_2_target   = 205
        self.feedrate_percentage = 100
        self.bed_temp            = 0
        self.bed_target          = 110
        self.print_progress      = 0
        self.print_time_hrs      = 0
        self.print_time_min      = 0
        self.fan_percentage      = 90
        self.position_x          = 0
        self.position_y          = 0
        self.position_z          = 0

Most of these are pretty easy to access from the python code. Each module is initialized with a "printer" parameter. This variable stores references to all the other modules. So, for example, to get the extruder temp it would look something like: self.printer.lookup_module('extruder0').get_heater().get_temp()

(Technically, the above would be on the work-probe-20170609 branch, which I hope to merge soon.)

One difficulty, though, is the print_progress/print_time variables. Klipper doesn't have this info - it's only in OctoPrint (or whatever else feeds g-code to Klipper).

Attached are all the files I modified.

Some comments:

the host code can't implement hardware delays. One could think of the mcu as a separate thread with a large queue between it and the main processing thread in the host code. A delay in the host thread likely wont translate to a delay in the mcu thread.
if one really wanted to sleep in the host code (for reasons other than a hardware delay) then the reactor.pause() method is the way to do that.
thinking about it a bit further, I think it's probably best for the lcd command code to just dispatch each lcd cmd and data request directly (no handoff to the task code). With data bursts being the norm and being relatively fast (~10us per byte) the simplicity is likely a bigger win than a minor delay to other tasks. For the occasional command that is received before the previous one finishes, the command handler can just spin until it's ready.
given that commands aren't always tied to data, I'd consider replacing "st7920_burst cmd_then_data=%s" with two commands: "st7920_send_commands cmds=%s" and "st7920_send_data data=%*s".
have you given any thought to an implementation of a "framebuffer" in the python code that checks for differences between two screens and sends the differences to the hardware? Could make the drawing code simpler and easier to reuse with other displays.

Cheers, -Kevin

marcio-ao commented 6 years ago

One difficulty, though, is the print_progress/print_time variables.

Those can certainly be removed. I just put them in to match what Marlin had.

the host code can't implement hardware delays.

Yes, I know. The only reason I put delays in host is to make sure the Python code does not get ahead of the FW and send too much data for the buffers.

thinking about it a bit further, I think it's probably best for the lcd command code to just dispatch each lcd cmd and data request directly

The command bytes still need to be separated by 72u. This is why I was trying to have the task handle it. It appears as if Python can only delay with a granularity of 1ms or so, so my idea is that Python would queue up multiple commands that added up to roughly 1ms, then wait 1ms for them to complete.

have you given any thought to an implementation of a "framebuffer"

Well, I taking heavy advantage of the built-in character generator in the ST7920, so it already fairly optimized. The only thing I draw as bits is the progress bar outline. Everything else is text.

But one problem with the current implementation is that it won't support the other 128x64 displays out there which are bitmap-only displays and don't have the character generation capabilities of the ST7920. However, I was thinking that it would be fairly straightforward to write some Python code to emulate the ST7920 and present bitmap data to devices that require it. This would be the opposite approach that Marlin uses -- they use U8G to treat all devices as bitmap devices, then ignore the special character generation capabilities present in the ST7920.

KevinOConnor commented 6 years ago

On Wed, Jan 24, 2018 at 10:26:35PM +0000, Marcio Teixeira wrote:

One difficulty, though, is the print_progress/print_time variables.

Those can certainly be removed. I just put them in to match what Marlin had.

the host code can't implement hardware delays.

Yes, I know. The only reason I put delays in host is to make sure the Python code does not get ahead of the FW send too much data for the buffers.

That wont work. There's a big queue between the mcu thread and the host thread (potentially with retransmits and other large dynamic latency). Timing has to be done in the mcu, or extreme measures have to be done in the host (like querying the mcu for when its done and then implementing the delay only after the query response is received).

thinking about it a bit further, I think it's probably best for the lcd command code to just dispatch each lcd cmd and data request directly

The command bytes still need to be separated by 72u. This is why I was trying to have the task handle it. It appears as if Python can only delay with a granularity of 1ms or so, so my idea is that Python would queue up multiple commands that added up to roughly 1ms, then wait 1ms for them to complete.

I was thinking just wait the 72us in the command dispatch code. I'll try to prototype it (but probably not until next week).

have you given any thought to an implementation of a "framebuffer"

Well, I taking heavy advantage of the built-in character generator in the ST7920, so it already fairly optimized. The only thing I draw as bits is the progress bar outline. Everything else is text.

Oh, I was thinking use a framebuffer for the text too. That is, a python bytearray(64) could store all text, and updates could be done with something like:

def write_str(x, y, s): pos = y*16 + x self.text_framebuffer[pos:pos+len(s)] = s

and then periodically the code could walk through each position to find the changes - something like:

for i in range(64):
    if self.previous_text_framebuffer[i] != self.text_framebuffer[i]:
        self.send_ddram(i, self.text_framebuffer[i])
self.previous_text_framebuffer = bytearray(self.text_framebuffer)

-Kevin

wizhippo commented 6 years ago

@marcio-ao It looks like you are using the reprap discount full graphic smart display. Is PSB pull high by default to be in 8/4 bit mode and did you mod your board so it uses serial instead?

marcio-ao commented 6 years ago

@wizhippo: Our printers drive the display in serial mode and this is how the Mini Rambo and Rambo are configued. I'm not sure if there are other printers out there that drive it in parallel mode. My code currently only works in serial mode, but could probably be modified for parallel mode if that's what you need.

KevinOConnor commented 6 years ago

FYI, I updated the work-lcd-20180115 branch: cd ~/klipper ; get fetch ; git checkout origin/work-lcd-20180115

This has demo code for reading button presses from the mcu (see the example-extras.cfg file for an example buttons section). I've also pulled in Marcio's python code and tried to merge it with some updated mcu xmit code. The branch is now on top of the probe-20170609 branch, which has some improvements to how the python modules are structured.

Separately, it looks like the "2004" display I bought a few months ago has a similar interface to the "12864" display. I might try to see if I can get some basic output from the 2004 display.

KevinOConnor commented 6 years ago

FYI, I was able to bring up the "2004" display. Demo code available on the work-lcd-20180115 branch.

petriska commented 6 years ago

testing 2004 display, Demo has to be running "#" symbol ?

KevinOConnor commented 6 years ago

On Mon, Jan 29, 2018 at 12:05:32PM -0800, petriska wrote:

testing 2004 display, Demo has to be running "#" symbol ?

Yes - the demo just writes '#' to the screen. The demo code shows that it is possible to write to the screen, it does not write anything useful.

-Kevin

marcio-ao commented 6 years ago

@KevinOConnor: Is the 2004 a character-only display?

I notice one screenshot that shows a graphical element, although it is not clear whether they are using custom character glyphs for this (i.e. CGRAM):

marcio-ao commented 6 years ago

The data sheets for the 2004 are universally bad. Here are a few I found:

https://www.beta-estore.com/download/rk/RK-10290_410.pdf https://cdn-shop.adafruit.com/datasheets/TC2004A-01.pdf

It looks like the best way to decode this display might be to start with the relatively more in-depth ST7920 datasheet (https://www.crystalfontz.com/controllers/Sitronix/ST7920/323/) and try figuring out what happens when you write to the various memory areas.

marcio-ao commented 6 years ago

Ah, I think this might be it. It looks like the controller is SPLC780D1 which is similar to the ST7920, but not exactly:

https://www.crystalfontz.com/controllers/OriseTech/SPLC780D1/417/

KevinOConnor commented 6 years ago

On Tue, Jan 30, 2018 at 05:04:06PM +0000, Marcio Teixeira wrote:

@KevinOConnor: Is the 2004 a character-only display?

I notice one screenshot that shows a graphical element, although it is not clear whether they are using custom character glyphs for this (i.e. CGRAM):

As far as I know, it is text only. But up to 8 characters may be user defined via the CGRAM mechanism.

My display has 5x8 lcd cells that are separated by some blank space on each side, so I don't believe the picture you posted could be done on my display.

On Tue, Jan 30, 2018 at 05:16:01PM +0000, Marcio Teixeira wrote:

The data sheets for the 2004 are universally bad. Here are a few I found:

https://www.beta-estore.com/download/rk/RK-10290_410.pdf https://cdn-shop.adafruit.com/datasheets/TC2004A-01.pdf

It looks like the best way to decode this display might be to start with the relatively more in-depth ST7920 datasheet (https://www.crystalfontz.com/controllers/Sitronix/ST7920/323/) and try figuring out what happens when you write to the various memory areas.

On Tue, Jan 30, 2018 at 09:20:20AM -0800, Marcio Teixeira wrote:

Ah, I think this might be it. It looks like the controller is SPLC780D1 which is similar to the ST7920, but not exactly:

https://www.crystalfontz.com/controllers/OriseTech/SPLC780D1/417/

I used this datasheet when writing the code:

https://www.sparkfun.com/datasheets/LCD/HD44780.pdf

I don't really know if this is the chip or not - I get the impression this is the chip that a bunch of other chips emulate.

A delay on each data write was necessary on my display (only delaying on command writes was not sufficient). I am now able to write arbitrary text to each visible data cell.

-Kevin

rafaljot commented 6 years ago

Nice, but now I think Arduino Uno with LCD like this is smart $20 solution https://botland.com.pl/arduino-shield-klawiatury-i-wyswietlacze/2729-dfrobot-lcd-keypad-shield-wyswietlacz-dla-arduino.html

I ordered one to check what can I do with it and additional UART or i2c.

Klipper3d / klipper

Physical UI support? #2

Send gfx position update

define DELAY_312ns asm("nop\n\tnop\n\tnop\n\tnop\n\tnop\n\t");