Open adishavit opened 7 years ago
Windows programs have their arguments stored in the structure Process Environment Block
(PEB), more specifically in ProcessParameters->CommandLine
which is a UNICODE_STRING
. The GetCommandLineW()
function give direct access to the UNICODE_STRING buffer. This is true for both Windows GUI and console programs. The buffer contains the command line arguments as a single wide string, exactly as entered, for example like this:
"path_to_program" -arg1 /f -abc file1 file2
The C library startup code parses the string, assign and allocate argc
and argv
before calling main
.
So, wouldn't it be easier to pass the unaltered command line string as is to argh
? Splitting the string into arguments using space as delimiter would be straightforward, I think.
GetCommandLineA()
will give a copy of the string converted to the system ANSI codepage. Which gives, even if you develop a UTF8 console program, the most efficient would be to parse the arguments as UTF16, retrieved by GetCommandLineW()
and ignoring main's parameters.
Thanks for your enthusiasm for Argh! A few comments about this.
add_param()
. I guess the ctor would have to be written such that a later call to parse()
would just overwrite anything done before.I don't have a lot of experience with parsing Unicode in general nor on Unicode variants on Windows in particular, so it is hard for me to comment. Would you like to make a pull request?
Thanks for answering. Just FYI, getting direct access to command line buffer in PEB via GetCommandLineW()
is often the usual way, at least when it comes to Windows GUI. It is possible to manipulate the string and even overwrite it without any consequences for the program. After all, it is just a string. My intention was more of a fancy, to mention an idea without any concrete method to implement it argh. None of the different option parser solutions available has considered this possibility, I guess due to the subject is very much UNIX centred.
I’m not a Unicode expert but I know the way Windows handles Unicode is messed up. I took the liberty to consult some of the more knowledgeable Unicode/C++ experts on Twitter (where else 😆). The lively discussion is here.
Seems like the only sane and portable thing to do is UTF8 only. As @cor3ntin says:
I would immediately convert each argument to utf8 with
WideCharToMultiByte
, keep the rest of the code as it & assume utf8 ... _DO NOT letwchar_t
invade your project. Smallest possible conversion layer._
Essentially converting anything to UTF8 as shown here:
#include <windows.h>
#include <string>
#include <algorithm>
#include <vector>
std::string convert(const wchar_t* wstr) {
int s = WideCharToMultiByte(CP_UTF8, 0, wstr, (int)wcslen(wstr), NULL, 0, NULL, NULL);
std::string str;
str.resize(s);
WideCharToMultiByte(CP_UTF8, 0, wstr, (int)wcslen(wstr), LPSTR(str.data()), s, NULL, NULL);
return str;
}
int wmain( int argc, wchar_t ** argv) {
std::vector<std::string> vec;
vec.reserve(argc);
std::transform(argv, argv + argc, std::back_inserter(vec), convert);
// parse(vec)
}
If you need the data in some other encoding (e.g. for passing to WIN32 componenets) do it on the other end.
Thank you for your effort. However, your answer doesn't address the subject. My point was about using the PEB command line buffer
for option parsing. As I said this could either be a direct wide UTF16 string or a copied and converted narrow ANSI string. Strong preference for parsing the wide buffer directly. I also made a suggestion on how this could be achieved (see below). I'm sure your twitter friends all are skilled people with valid points in their arguments, but where do they apply in this subject? I'm reluctant to engage in flaming UTF8 vs UTF16 vs UTF32 discussions, but in practice, in Windows Unicode is UTF16 and has been so for about 30 years. It's not a matter of preference or flavour, it's just how things are. The whole underlying system is coded in UTF16. Also in practice, coding with UTF8 is more or less unsupported. You deal with it when handling data as in manipulating a webpage for example, but the API expects UTF16 when called. It's all due to the UTF16 'W' API variants vs the legacy 'A' as in 'ANSI' API variants. Mixing narrow UTF8 and ANSI is likely to be troublesome. As I understand, this is about to change when it comes to Windows UWP apps where UTF8 is encouraged to facilitate web-centric code. Anyway, nothing of this should be a bother for you in this library. Especially if you consider my suggestion in the thread
On Windows there is an extension to get the args automatically without passing them in, as in:
Things to note:
__argv
et al. perform wildcard expansion, which may or may not be desirable.