argtable / argtable3

A single-file, ANSI C, command-line parsing library that parses GNU-style command-line options.
http://www.argtable.org
Other
372 stars 65 forks source link

Typo concerning Unicode functions #26

Closed computerquip-streamlabs closed 5 years ago

computerquip-streamlabs commented 5 years ago

While debugging some warnings given by argtable, I found that none of the wide character functions were used. I found this: https://github.com/argtable/argtable3/blob/master/argtable3.c#L2977

When I corrected it, the library no longer works unfortunately. I'm currently preparing a patch but in particular, the privhdr structure uses a hard-coded type of const char* which is then passed to a function that takes TRex. If _UNICODE is defined, that's defined as unsigned short.

Looking at the original T-Rex implementation, it's been like that since the beginning, as seen here: https://github.com/kimperator/T-Rex/blob/master/trex.c#L8

surreylabs commented 5 years ago

@computerquip-streamlabs When dealing with Unicode, I only plan to support the UTF-8 encoding, because it is compatible with most libraries. May I know if UTF-8 is working for you?

computerquip-streamlabs commented 5 years ago

It does appear to. That said, the warnings given are valid. These would be incorrect in that case since they assume a 2-byte wchar_t type:

#define TRexChar unsigned short
#define MAX_CHAR 0xFFFF
#define _TREXC(c) L##c
#define trex_strlen wcslen
#define trex_printf wprintf

All functions that use TRexChar are then apt to some weird behavior since it's treated as a 2-byte string rather than a 1-byte string.

I need to try some more complicated UTF-8 codepoints as well that actually require more than one byte a codepoint... this is kinda worrying if I'm honest.

computerquip-streamlabs commented 5 years ago

It seems the trex code is only used for regex-based argument matching (and isn't used in general). I've just made sure none of my code uses that as I'm sure there's edge cases there. As a result though, I'm not affected and those warnings are little more than a nuisance now.

The original Trex code was meant for just ASCII and then Unicode support through wide-character support. That code would probably need to be modified to make sure it works with UTF-8 though I'm not sure. Some distinctive testing for UTF-8 with the regex argument parsing should probably be done.

tomghuang commented 5 years ago

@computerquip-streamlabs I've removed the UNICODE support from the embedded TRex code. I've also fixed some Unicode-related issues so that even when you compile with UNICODE definition in Visual Studio, the library should be compiled successfully.