adishavit / argh

Argh! A minimalist argument handler.
BSD 3-Clause "New" or "Revised" License
1.33k stars 93 forks source link

Windows only: Consider using __argc and __argv #15

Open adishavit opened 7 years ago

adishavit commented 7 years ago

On Windows there is an extension to get the args automatically without passing them in, as in:

int main()
{
   argh::parser cmdl;
   // ...
}

Things to note:

BeErikk commented 4 years ago

Windows programs have their arguments stored in the structure Process Environment Block (PEB), more specifically in ProcessParameters->CommandLine which is a UNICODE_STRING. The GetCommandLineW() function give direct access to the UNICODE_STRING buffer. This is true for both Windows GUI and console programs. The buffer contains the command line arguments as a single wide string, exactly as entered, for example like this:

"path_to_program" -arg1 /f -abc file1 file2

The C library startup code parses the string, assign and allocate argc and argv before calling main.

So, wouldn't it be easier to pass the unaltered command line string as is to argh? Splitting the string into arguments using space as delimiter would be straightforward, I think.

GetCommandLineA() will give a copy of the string converted to the system ANSI codepage. Which gives, even if you develop a UTF8 console program, the most efficient would be to parse the arguments as UTF16, retrieved by GetCommandLineW() and ignoring main's parameters.

adishavit commented 4 years ago

Thanks for your enthusiasm for Argh! A few comments about this.

  1. Argh! tries to be standard C++ conforming and cross platform. It may be possible to add extra Windows functionality, but it must not alter the standard API.
  2. Pulling the command line "out of the air" via the default ctor, could be an issue, if the user then uses other pre-parsing methods like add_param(). I guess the ctor would have to be written such that a later call to parse() would just overwrite anything done before.
  3. Unicode support is problematic, confusing and inconsistent in C++ and on Windows specifically. There are standardization efforts for both better Unicode support and more modern ways to get the command line arguments. Maybe we will have a C++23 branch for argh that will support these when they portably arrive.

I don't have a lot of experience with parsing Unicode in general nor on Unicode variants on Windows in particular, so it is hard for me to comment. Would you like to make a pull request?

BeErikk commented 4 years ago

Thanks for answering. Just FYI, getting direct access to command line buffer in PEB via GetCommandLineW() is often the usual way, at least when it comes to Windows GUI. It is possible to manipulate the string and even overwrite it without any consequences for the program. After all, it is just a string. My intention was more of a fancy, to mention an idea without any concrete method to implement it argh. None of the different option parser solutions available has considered this possibility, I guess due to the subject is very much UNIX centred.

adishavit commented 4 years ago

I’m not a Unicode expert but I know the way Windows handles Unicode is messed up. I took the liberty to consult some of the more knowledgeable Unicode/C++ experts on Twitter (where else 😆). The lively discussion is here.

adishavit commented 4 years ago

Seems like the only sane and portable thing to do is UTF8 only. As @cor3ntin says:

I would immediately convert each argument to utf8 with WideCharToMultiByte, keep the rest of the code as it & assume utf8 ... _DO NOT let wchar_t invade your project. Smallest possible conversion layer._

Essentially converting anything to UTF8 as shown here:

#include <windows.h>
#include <string>
#include <algorithm>
#include <vector>

std::string convert(const wchar_t* wstr) {
    int s = WideCharToMultiByte(CP_UTF8, 0, wstr, (int)wcslen(wstr), NULL, 0, NULL, NULL);
    std::string str;
    str.resize(s);
    WideCharToMultiByte(CP_UTF8, 0, wstr, (int)wcslen(wstr), LPSTR(str.data()), s, NULL, NULL);
    return str;
}

int wmain( int argc, wchar_t ** argv) {
    std::vector<std::string> vec;
    vec.reserve(argc);
    std::transform(argv, argv + argc, std::back_inserter(vec), convert);
    // parse(vec)
} 

If you need the data in some other encoding (e.g. for passing to WIN32 componenets) do it on the other end.

BeErikk commented 4 years ago

Thank you for your effort. However, your answer doesn't address the subject. My point was about using the PEB command line buffer for option parsing. As I said this could either be a direct wide UTF16 string or a copied and converted narrow ANSI string. Strong preference for parsing the wide buffer directly. I also made a suggestion on how this could be achieved (see below). I'm sure your twitter friends all are skilled people with valid points in their arguments, but where do they apply in this subject? I'm reluctant to engage in flaming UTF8 vs UTF16 vs UTF32 discussions, but in practice, in Windows Unicode is UTF16 and has been so for about 30 years. It's not a matter of preference or flavour, it's just how things are. The whole underlying system is coded in UTF16. Also in practice, coding with UTF8 is more or less unsupported. You deal with it when handling data as in manipulating a webpage for example, but the API expects UTF16 when called. It's all due to the UTF16 'W' API variants vs the legacy 'A' as in 'ANSI' API variants. Mixing narrow UTF8 and ANSI is likely to be troublesome. As I understand, this is about to change when it comes to Windows UWP apps where UTF8 is encouraged to facilitate web-centric code. Anyway, nothing of this should be a bother for you in this library. Especially if you consider my suggestion in the thread

https://github.com/adishavit/argh/issues/8