Genivia / ugrep

NEW ugrep 6.5: a more powerful, ultra fast, user-friendly, compatible grep. Includes a TUI, Google-like Boolean search with AND/OR/NOT, fuzzy search, hexdumps, searches (nested) archives (zip, 7z, tar, pax, cpio), compressed files (gz, Z, bz2, lzma, xz, lz4, zstd, brotli), pdfs, docs, and more
https://ugrep.com
BSD 3-Clause "New" or "Revised" License
2.56k stars 109 forks source link

[FR/Q] How do I use `ugrep` as a library in C++? #345

Closed NightMachinery closed 7 months ago

NightMachinery commented 7 months ago

I like to add ugrep as an option for "fuzzy" finding to https://github.com/ahrm/sioyek/issues/567 . The code currently does this:

int score = static_cast<int>(rapidfuzz::fuzz::partial_ratio(s1, s2));

return score > 50;

where s1 is the pattern and s2 is the string that I want to check for a match. I like to do this match using ugrep --bool --smart-case. Can I do this without calling a subprocess?

PS: Complete code snippet:

bool MySortFilterProxyModel::filterAcceptsRow(int source_row,
    const QModelIndex& source_parent) const
{
    if (FUZZY_SEARCHING) {

        QModelIndex source_index = sourceModel()->index(source_row, this->filterKeyColumn(), source_parent);
        if (source_index.isValid())
        {
            // check current index itself :

            QString key = sourceModel()->data(source_index, filterRole()).toString();
            if (filterString.size() == 0) return true;
            std::wstring s1 = filterString.toStdWString();
            std::wstring s2 = key.toStdWString();
            int score = static_cast<int>(rapidfuzz::fuzz::partial_ratio(s1, s2));

            return score > 50;
        }
        else {
            return false;
        }
    }
    else {
        return QSortFilterProxyModel::filterAcceptsRow(source_row, source_parent);
    }
}

void MySortFilterProxyModel::setFilterCustom(QString filterString) {
    if (FUZZY_SEARCHING) {
        this->filterString = filterString;
        this->setFilterFixedString(filterString);
        sort(0);
    }
    else {
        this->setFilterFixedString(filterString);
    }
}

bool MySortFilterProxyModel::lessThan(const QModelIndex& left,
    const QModelIndex& right) const
{
    if (FUZZY_SEARCHING) {

        QString leftData = sourceModel()->data(left).toString();
        QString rightData = sourceModel()->data(right).toString();

        int left_score = static_cast<int>(rapidfuzz::fuzz::partial_ratio(filterString.toStdWString(), leftData.toStdWString()));
        int right_score = static_cast<int>(rapidfuzz::fuzz::partial_ratio(filterString.toStdWString(), rightData.toStdWString()));
        return left_score > right_score;
    }
    else {
        return QSortFilterProxyModel::lessThan(left, right);
    }
}
NightMachinery commented 7 months ago

BTW, this application would benefit from being able to sort the results so that "better" matches are on top. I don't think ugrep can expose such a score?

int left_score = static_cast<int>(rapidfuzz::fuzz::partial_ratio(filterString.toStdWString(), leftData.toStdWString()));
int right_score = static_cast<int>(rapidfuzz::fuzz::partial_ratio(filterString.toStdWString(), rightData.toStdWString()));
        return left_score > right_score;
genivia-inc commented 7 months ago

Please take a look at this project: FuzzyMatcher. The fuzzy matcher is also included with RE/flex which in turn is included in ugrep. For case-independent matching and Unicode support, you want to use the regex converter and pass the converted regex string to the fuzzy matcher for matching. There are examples that show how to do that and use (?i)PATTERN to make PATTERN match case insensitive. The fuzzy matcher has an edits() method that returns the edit distance of a match, after matching.

NightMachinery commented 7 months ago

@genivia-inc Do these libraries support --bool syntax?

genivia-inc commented 7 months ago

@genivia-inc Do these libraries support --bool syntax?

No. The --bool syntax is layered on top, using CNF normalization in ugrep/src/cnf.cpp and by using multiple matchers, one for each term in the CNF. E.g. a|(b c) is normalized to (a|b) (a|c) for which two (fuzzy)matchers are created for patterns a|b and a|c. Both patterns must match (on the same line or anywhere in a file with --files).