afritz1 / OpenTESArena

Open-source re-implementation of The Elder Scrolls: Arena.
MIT License
988 stars 68 forks source link

Re-implement fnmatch.h/.c to keep from needing GPL2 license #171

Closed afritz1 closed 4 years ago

afritz1 commented 4 years ago

It's the only GPL2 code in the codebase, everything else is MIT. I didn't even add it in the first place; it was added when the virtual filesystem was being integrated.

It also messes with GitHub's license identification because of both LICENSE.txt and COPYING.txt.

Replace it with a new C++17 version. I plan on doing this myself soon.

Reasons:

afritz1 commented 4 years ago

Commit 19573105e8a34a9ae2c070d3484a7aa58197bed1.

Digital-Monk commented 4 years ago

Looking at the commit, I worry that there might be some complaint about the new version being copied-and-modified (thus derived) from the original. Can't imagine anyone seriously coming after you for that, but to remove the GPL restrictions, you need to clean-room re-engineer the implementation based only on the documented interface (and I'm assuming you wouldn't have the 'goto' at that point).

Since you're using C++17, how hard would it be to redesign that function in terms of <regex>?

afritz1 commented 4 years ago

Never really used regex before. It's worth a try. I didn't really like how my fnmatch changes ended up but I just wanted to move on and do more interesting things.

I don't know how to redesign it in terms of regex. Any ideas?

Digital-Monk commented 4 years ago

Not a lot of experience myself. Starting out with regex can give a bit of a headache, keeping track of which characters are special and need escaping. But it shouldn't be too bad in this case. It will need a function that translates from fnmatch patterns to ECMAscript style regex. I think that just means:

So, block*.blk would translate into block.*\.blk (or, if you need that last step above, [^block.*\.blk$]).

I'm not sure how heavily fnmatch gets used in your code, and how complete it needs to be. Most people only ever used the ? and * wildcards, and didn't even know that [set of possible chars] was an option. If you know that the sets aren't used in your code, then this gets much easier...

Once you have the translated pattern string:

#include <regex>
...
std::regex pattern(translatedPatternString);
// and then, approximately:
return std::regex_match(str, pattern);

You'll want to use regex_match instead of regex_search so that the entirety of str has to match the pattern (nothing extra in front or behind).

Tutorials I used to get this far:

And that may be more hassle than you care to commit to for a minor thing that's already working. But if it ever bugs you, hopefully those pointers will speed the process...

afritz1 commented 4 years ago

Played around a bit with it tonight but struggled to really get anywhere. I tried using the grep regex constant instead of ECMAScript since that's closer to what fnmatch() uses I think? I'm still too much of a newbie with regular expressions to really steam through this though.

Digital-Monk commented 4 years ago

Experimenting presently. Do you know what patterns are given to fnmatch, or are they generated by other libraries?

Digital-Monk commented 4 years ago

Cheesy, but seems to work for basic test cases (just slap this in a main.cpp and run it to play around):

#include <cstdio>
#include <regex>
#include <list>
#include <string>

using namespace std;

string ECMAize(const string & fnstyle);
void replace(string & s, const string& original, const string& replacement);

int main()
{
    list<string> names
    {
        "bob.txt",
        "block1.blk",
        "fred.txt",
        "fred.dat",
        "assets/media/bing.wav",
        "assets/media/die.wav",
        "assets/media/win.avi",
        "assets/models/human.obj",
        "assets/models/orc.obj"
    };
    char buf[100];

    puts("The 'file system':");
    for (string name : names)
    {
        printf("\t%s\n", name.c_str());
    }

    while(true) {
        puts("\n\nTest pattern: ");
        gets(buf);

        string ecma = ECMAize(buf);
        printf("\tConverted regex : %s\n", ecma.c_str());

        regex matcher(ecma);

        puts("Results:");
        for (string name : names)
        {
            printf("\t%s : %s\n", name.c_str(), regex_match(name, matcher) ? "MATCH" : "no");
        }
    }

    return 0;
}

// fnmatch specializing flags that we need to match:
// (FNM_PATHNAME|FNM_NOESCAPE|FNM_PERIOD)
// from the man page:
//FNM_NOESCAPE
//       If this flag is set, treat backslash as an ordinary character,
//       instead of an escape character.
// (Comment: doable -- we just escape them ourselves before passing to regex)
//
//FNM_PATHNAME
//       If this flag is set, match a slash in string only with a slash
//       in pattern and not by an asterisk (*) or a question mark (?)
//       metacharacter, nor by a bracket expression ([]) containing a
//       slash.
// (Comment: bleah.  * and ? are easy enough, though messy.  Instead of
//              ? -> . and * -> .*
//           we have to do:
//              ? -> [^/] and * -> [^/]*
//           Handling bracket expressions is more irritating though...
//           We'd have to catch every [] set expression and then make
//           sure that it doesn't include /.  Which also means checking
//           any ranges that are present to ensure they don't include /,
//           which is a little weird.  For instance, [!-9] includes a lot
//           of symbols and all the digits, and we'd have to break it
//           into [!-\.0-9].  Again, unless you've got something using
//           [] set matching in your code, it's probably not worth the
//           headache...)
//
//FNM_PERIOD
//       If this flag is set, a leading period in string has to be
//       matched exactly by a period in pattern.  A period is
//       considered to be leading if it is the first character in
//       string, or if both FNM_PATHNAME is set and the period
//       immediately follows a slash.
// (Comment: Hmmm...  Misunderstood this at first, gonna have to
//           think about it more.
//           I keep running around in circles, and I think this would
//           turn into a HUGE headache to actually fix.
//           FORTUNATELY: it apparently doesn't matter, because the
//           parts of fnmatch that were needed by the rewrite never
//           reference '.' at all.  Ignoring it...)
string ECMAize(const string & fnstyle)
{
    string translated = fnstyle;

    // Remember that \ is an escape in regex, but also in
    // string literals, so to get a single \ we need "\\", and
    // to get a doubled \\ we need "\\\\"

    // The order of replacement is important...

    // FNM_NOESCAPE -- we need to handle any backslashes in the
    // original BEFORE we start doing our replacements, because
    // our replacements may need backslash escapes that need to
    // stay escapes.
    replace(translated, "\\", "\\\\");

    // Generic syntax change: fnmatch treats . as a literal,
    // but regex treats it as a single wildcard.  We must
    // escape any .s:
    replace(translated, ".", "\\.");

    // FNM_PATHNAME:  Can't just recognize single chars or
    // "any chars", we have to use a set of all chars EXCEPT
    // the slash.
    replace(translated, "?", "[^/]");
    replace(translated, "*", "[^/]*");

    // Finally, generic syntax change from fnmatch's use of
    // ! to negate a set to regex's use of ^ to negate a set.
    replace(translated, "[!", "[^");

    return translated;
}

void replace(string & s, const string& original, const string& replacement)
{
    size_t start = 0;
    while (true)
    {
        size_t pos = s.find(original, start);
        if (pos == string::npos)
        {
            // no more instances, we're done
            return;
        }

        // Fairly straightforward: chop out original and
        // replace it with replacement.
        s.replace(pos, original.length(), replacement);

        // Less obvious: move the start position forward
        // past the replacement.  If we don't do this,
        // we will go into infinite loop if replacement
        // contains original.
        // Relevant example:
        //      original = "*"
        //      replacement = "[^/]*"
        // If we don't jump past replacement, then our
        // next loop will try to replace the new * and
        // we'll have "[^/][^/]*", etc, etc, etc...
        start = pos + replacement.length();
    }
}
afritz1 commented 4 years ago

I added syntax highlighting to your code. Trying it out soon.

afritz1 commented 4 years ago

I'm comparing results between two fnmatch() implementations, one being the current stripped-down fnmatch() and the other using your regex stuff, and I'm a little confused, because I made the pattern be *.MIF in the hopes it would list only files with the .MIF extension, but it also prints files with .DAT, .VOC, and .TXT, and whatever else. Am I misunderstanding how patterns work?

int fnmatch_regex(const char *pattern, const char *str, int flags)
{
    if ((pattern == nullptr) || (str == nullptr))
    {
        DebugLogError("'pattern' or 'str' was null.");
        return FNM_FAILURE;
    }

    if (flags != 0)
    {
        DebugLogError("'flags' not supported in fnmatch() implementation.");
        return FNM_FAILURE;
    }

    static_cast<void>(flags);

    // Convert pattern to ECMAScript.
    auto patternToECMA = [](const char *pattern)
    {
        std::string translated = pattern;
        translated = String::replace(translated, "\\", "\\\\");
        translated = String::replace(translated, ".", "\\.");
        translated = String::replace(translated, "?", "[^/]");
        translated = String::replace(translated, "*", "[^/]*");
        translated = String::replace(translated, "[!", "[^");
        return translated;
    };

    const std::regex patternECMA(patternToECMA(pattern));
    return std::regex_match(str, patternECMA) ? FNM_SUCCESS : FNM_NOMATCH;
}

String::replace() definition: https://github.com/afritz1/OpenTESArena/blob/cf4198c1d1bd276253df44839d5d53004d9742d2/components/utilities/String.cpp#L231

Digital-Monk commented 4 years ago

Something screwy is going on with ... I compiled my sample at work with the gcc that came with my Qt compiler (don't remember version), and it worked fine. I just now compiled it at home with 7.4.0 and NOTHING matches. Exact patterns don't match (bob.txt doesn't match bob.txt). "Anything at all" patterns don't match (* doesn't match anything). Gotta dig some more...

[edit] Okay, this is rather old, but still, FFS... https://stackoverflow.com/questions/12530406/is-gcc-4-8-or-earlier-buggy-about-regular-expressions

afritz1 commented 4 years ago

The only GCC version I knew of that had regex bugs was 4.9.0.

Digital-Monk commented 4 years ago

Yeah, turned out to be a red herring. So... My g++ at home didn't just deprecate gets(), it removed it. So I switched to fgets(buf,sizeof(buf),stdin), which had a side effect I'd forgotten about -- in includes the newline at the end of input. So my patterns all had a newline in them, so they never matched any of the filenames.

Fixed that D'oh! moment (with the (only slightly) nicer cin >> buf;), and it's working as I would expect again. When I give it *.mif it matches the mif files in my "root" directory and nothing else.

Case sensitivity, by any chance? No, that would make it fail to match things, not match everything...

Hmmm... Can you log the pattern you're giving it, the regex it comes up with, and the filenames it's testing against (just a few)?

Sorry, this is turning into more of a headache for you than I ever intended.

Digital-Monk commented 4 years ago

Test results on this side:

Test pattern: 
*.mif
    Converted regex : [^/]*\.mif
Results:
bob.txt : 0
block1.blk : 0
fred.txt : 0
fred.dat : 0
orc.mif : 1
orc.dat : 0
orc.voc : 0
assets/media/bing.wav : 0
assets/media/die.wav : 0
assets/media/win.avi : 0
assets/models/human.obj : 0
assets/models/orc.obj : 0
Digital-Monk commented 4 years ago

I couldn't see any logical difference between your and my code, but just in case I was missing something subtle, I copied your String::replace code and then your patternToECMA function back into my hacky test code. Got the same results as above (ie, matched orc.mif like it should, matched nothing else). So the rewrite is good.

Now I'm even more confused. Hopefully logging the patterns will make some kind of sense out of this...

afritz1 commented 4 years ago

Case-sensitivity might be an issue.

The 'file system':
        bob.txt
        block1.blk
        bob.mif
        BOB.MIF
        fred.txt
        fred.dat
        assets/media/bing.wav
        assets/media/die.wav
        assets/media/win.avi
        assets/models/human.obj
        assets/models/orc.obj
        assets/models/orc.mif

Test pattern:
*.mif
        Converted regex : [^/]*\.mif
Results:
        bob.txt : no
        block1.blk : no
        bob.mif : MATCH
        BOB.MIF : no
        fred.txt : no
        fred.dat : no
        assets/media/bing.wav : no
        assets/media/die.wav : no
        assets/media/win.avi : no
        assets/models/human.obj : no
        assets/models/orc.obj : no
        assets/models/orc.mif : no
afritz1 commented 4 years ago

I was doing most of my testing with the engine's VFS manager list() function, and that calls fnmatch() behind the scenes, and it was returning hundreds of Arena filenames, both in the ARENA folder and inside GLOBAL.BSA that didn't appear to be matching correctly (i.e., given *.MIF, it would basically match *.MIF/*.VOC/etc.). The comment above is the first one with your code copy-pasted and a couple filenames added.

Digital-Monk commented 4 years ago
const std::regex patternECMA(patternToECMA(pattern), std::regex_constants::icase);

Should fix the case sensitivity issue (tested out OK here)

Digital-Monk commented 4 years ago

In the VFS system, how does it separate path components? Forward or backward slash? Just wondering, because a pattern of * won't traverse / paths, but it will traverse \ paths. Easy enough to fix, I was just taking the fnmatch man page docs literally, since they seemed specific about "slash" vs "backslash"... But then, fnmatch comes from unix-land, and Arena comes from DOS land, so I could see there being some mismatch in intent.

afritz1 commented 4 years ago

I don't even know. I didn't write the VFS code (that's part of the reason this is more complicated than it would be). Looks like it uses forward slashes.

std::vector<std::string> Manager::list(const char *pattern) const
{
    std::vector<std::string> files;

    std::for_each(gRootPaths.rbegin(), gRootPaths.rend(),
        [pattern, &files](const std::string &rootPath)
    {
        Manager::addDir(rootPath + '.', std::string(), pattern, files);
    });

    const std::vector<std::string> &bsaList = gGlobalBsa.list();

    if (pattern == nullptr)
    {
        std::copy(bsaList.begin(), bsaList.end(), std::back_inserter(files));
    }
    else
    {
        std::copy_if(bsaList.begin(), bsaList.end(), std::back_inserter(files),
            [pattern](const std::string &name)
        {
            const char *dirsep = std::strrchr(name.c_str(), '/');
            return fnmatch(pattern, (dirsep != nullptr) ? dirsep : name.c_str(), 0) == 0;
        });
    }

    return files;
}
afritz1 commented 4 years ago

Okay, so if I print out match/no-match directly from my fnmatch_regex() then it only matches *.mif with .mif files, so that's right. I don't know what Manager::list() is doing to get all those weird extras.

Digital-Monk commented 4 years ago

I don't like how they're using dirsep... It points to the last / in the name, so as a string, it will look something like /ORC.MIF, but I would expect it should be one past that, so that it just returned the filename ORC.MIF. Especially because if name doesn't contain any /s, then the name passed to fnmatch will also not have /s. Seems like that would cause name pattern problems, but maybe it's important to separate patterns in (any arbritrarily deep) subdir from patterns in the root.

I notice that Manager::addDir gets called for each root path. That could add a lot of stuff to files before we even get to the fnmatch part.

Also, if somehow pattern is nullptr on entry, it'll add every file. But that would have happened before, so that shouldn't be it.

The only other thing I can think of right now is that regex_match returns 1 on match and 0 on fail, but fnmatch returns 0 on match and 1 on fail (technically regex_match returns a bool, but, yeah, it's 1 and 0). I can't see anywhere that the flip could matter, though. In your fnmatch_regex, you ?: from the boolean to the official return values. And in Manager::list, the "match" condition is "==0" like it should be.

I dunno...

afritz1 commented 4 years ago

I'm going to leave it at commit 2766a1274f12538b1f0ac1a12ab2e8b1221695c3.

At this point I'm just going to assume that if the new fnmatch() is matching correctly then there's something wrong with Manager::list(), but since that code isn't called anywhere then I can close this issue.

I'm gonna assume that Manager::list() never worked in the first place with a non-null pattern. Different problem.