cfry / dde

Dexter Development Environment
Other
44 stars 25 forks source link

Serial encoding issues with binary data. aka the FD problem. #79

Open JamesNewton opened 2 years ago

JamesNewton commented 2 years ago

When communicating with a serial device which needs to receive or which will send, binary data, meaning character values over 127, you may find that those values get converted to 'fd' for some unknown reason.

The reason is the bane of my existence: UTF-8.

UTF-8 came from the Unicode effort which was designed to support character sets for languages other than English. English uses only 26 letters, a-z, but there are upper and lower case versions of those so 52 characters. Of course we also need to be able to present the Arabic numerals 0-9 so 62. And then spaces, punctuation marks, etc... perhaps 90 total printable symbols. But don't forget control characters, and then there are latin symbols, accent marks (one for each letter that might need one). We end up pretty much filling up 128 characters. That will fit in 7 bits and ASCII (American Standard Code for Information Interchange) was born. This almost uses up the basic 8 bits available computers work on, so there are actually another 128 characters available from 128-255, which were used for all sort of tricky things; line drawing, graphics, and so on. And for many years, that was good enough.

But there are other countries and other languages and they should be supported. Many other language have a large number of characters, so more than 8 bits are needed. When Javascript was written, it settled on 16 bits, and the UTF-16 set. https://developer.mozilla.org/en-US/docs/Web/API/DOMString/Binary "In JavaScript strings are represented using the UTF-16 character encoding: in this encoding, strings are represented as a sequence of 16-bit (2 byte) units. Every ASCII character fits into the first byte of one of these units, but many other characters don't."

So when you set e.g. var myStr = "ABC" what is actually stored in memory is (in hex) 00 40 00 41 00 42 where 40 is the ASCII code for A (again, in hex). In other words, every byte in the string is a 16 bit word where the bottom 8 bits are actually the value you expected and the top 8 bits are zero. If you need to encode a line of Kangi, or CCCII (Chinese version of ASCII), no problem, you have those extra 65536 - 256 characters to do it in.

But now, think about this: You write that string to a file. Then you open that file in a standard text editor. And it has nulls between every letter. Oops. They could just write out only the low byte, and that would work perfectly for ASCII strings. But what about UTF-16 character sets? Hmmm... They could write those out in UTF-16 just as stored in RAM, but then how would it know which format was used to write the file when you wanted to read the file back in?

So someone had the brilliant idea of creating UTF-8. This uses the same ASCII characters for all values between 0 and 127. But then if you want to write a 16 bit character, it writes out 3 bytes: A "flag" byte which would be C2 (in hex) and then the two bytes of the 16 bit value. When reading the file, that C2 will tell the program that this is not a standard ASCII file and it will decode the next two characters as a 16 bit value. If you want to write an atual C2 in the file, it will write a C1 then C2; the C1 meaning "flag the next byte as a single binary value" and the C2 being the value. In this way, every possible 16 bit value would always make the round trip from RAM to file and back correctly, and ASCII data would still look like ASCII... as long as it didn't use the top bit. And really, ASCII isn't supposed to use that top bit.

Win, win and dusted right?

Well... no. Because what about binary data? e.g. straight 8 bit data which isn't ASCII, but is still stored in a string? Well, first, you aren't supposed to use Strings for binary data, that's what buffers are for. But learning to do things with a new datatype? Pffft. DDE sends and receives data over the serial port as Strings like god intended. And that is probably the best design because most users will send and receive ASCII data and expect to work with Strings.

But when you need to send / receive binary data? Well the npm SerialPort library sets the encoding to UTF-8 by default. Not sure why... for files, yeah... for serial data? Lucky us, we can change that: https://nodejs.org/api/buffer.html#buffer_buffers_and_character_encodings

var port_path = "/dev/ttyUSB0" // "COM4" //last(serial_devices()).comName 

var acc_data="" //a variable to accumulate data into. Actually need one for each port. 
var data_pause //the setTimeout ID

function myOnReceiveCallback_low_level(data, port) {
    acc_data += data
    clearTimeout(data_pause)
    data_pause = setTimeout(
        function() {
    out(acc_data.split('').reduce((s,i) => s+("0"+i.charCodeAt(0).toString(16)).slice(-2)+" ", ''))
            acc_data=''
            }
        , 100
        )
    }

serial_connect_low_level(port_path
    , {baudRate: 115200}
    , capture_n_items=0
    , item_delimiter=""
    , trim_whitespace = false
    , parse_items=false
    , capture_extras=false
    , callback=myOnReceiveCallback_low_level
    //, error_callback=onReceiveErrorCallback_low_level
    //, open_callback=onOpenCallback_low_level
    //, close_callback=onCloseCallback_low_level
    ) 
var port_info = serial_port_path_to_info_map[port_path]

//need binary encoding. 
port_info.port.setEncoding("binary")

//Make a "binary" string from HEX data
var dmi820get = ""
"68 04 00 04 08".split(" ").forEach(str => {
  dmi820get += String.fromCharCode(parseInt(str,16))
})

//If we use the built in send it dumps binary data to the Output panel
//serial_send_low_level(ard_path, dmi820get) 
//but we can replace it:

function dmi820_get_data(port_info) {
    if (port_info){
        let arr =
            port_info.port.write(dmi820get, 
                                 function(error){ //can't rely on this getting called before onReceived so mostly pretend like its not called, except in error cases
                if (error){
                    dde_error("In serial_send callback to port_path: " + port_path +
                              " got the error: " + error.message)
                }
                else {
                    //out("serial write just sent: " + dmi820get)
                }
            }
                                )
        }
    }
dmi820_get_data(port_info)

Long story only slightly longer. This cost me a day. I see absolutely no reason for the encoding in a serial data transfer to be UTF-8. I would propose that in DDE, every serial port opened be set to "binary" encoding via .setEncoding("binary") as shown in the sample. It shouldn't mess up standard ASCII data (that should be tested), and it will allow us to communicate with binary devices.