Kyle-Kyle / top4grep

find relevant security papers published in the top-4 conferences (S&P, USENIX, CCS, NDSS)
160 stars 15 forks source link

Grep function incorrectly only finds perfect matches now #10

Open mahaloz opened 2 weeks ago

mahaloz commented 2 weeks ago

Since https://github.com/Kyle-Kyle/top4grep/commit/dbf03d20995c4b249471fc0404221ddd6a625d1b, top4grep no longer does partial matches. Here is an example demonstrating it:

top4grep -k dec
[Top4Grep][INFO]08-26 16:13 Grep based on the following keywords: dec
[Top4Grep][DEBUG]08-26 16:13 Found 0 papers

Do it again with the full word:

top4grep -k decompilation
[Top4Grep][INFO]08-26 16:13 Grep based on the following keywords: decompilation
[Top4Grep][DEBUG]08-26 16:13 Found 3 papers
2023: IEEE S&P - Pyfet: Forensically Equivalent Transformation for Python Binary Decompilation.
2015: NDSS     - No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring and Semantic-Preserving Transformations.
2013: USENIX   - Native x86 Decompilation Using Semantics-Preserving Structural Analysis and Iterative Control-Flow Structuring.

The Fix (hack)

I've fixed this by manually reverting the commit, resulting in the following diff:

diff --git a/top4grep/__main__.py b/top4grep/__main__.py
index af5b804..2d99357 100644
--- a/top4grep/__main__.py
+++ b/top4grep/__main__.py
@@ -53,15 +53,8 @@ def grep(keywords, abstract):
         constraints = [Paper.title.contains(x) for x in keywords]
         with Session() as session:
             papers = session.query(Paper).filter(*constraints).all()
-        #check whether whether nltk tokenizer data is downloaded
-        check_and_download_punkt()
-        #tokenize the title and filter out the substring matches
-        filter_paper = []
-        for paper in papers:
-            if all([stemmer.stem(x.lower()) in fuzzy_match(paper.title.lower()) for x in keywords]):
-                filter_paper.append(paper)
     # perform customized sorthing
-    papers = sorted(filter_paper, key=lambda paper: paper.year + CONFERENCES.index(paper.conference)/10, reverse=True)
+    papers = sorted(papers, key=lambda paper: paper.year + CONFERENCES.index(paper.conference)/10, reverse=True)
     return papers

Which fixes the above example:

top4grep -k dec
[Top4Grep][INFO]08-26 16:15 Grep based on the following keywords: dec
[Top4Grep][DEBUG]08-26 16:15 Found 107 papers
2023: USENIX   - Understand Users' Privacy Perception and Decision of V2X Communication in Connected Autonomous Vehicles.
2023: USENIX   - MobileAtlas: Geographically Decoupled Measurements in Cellular Networks for Security and Privacy Research.
2023: USENIX   - VeriZexe: Decentralized Private Computation with Universal Setup.
...

I've not created a PR for this fix because it changes this project's functionality. It's unclear if the project now intends to match exact strings rather than substrings, which is the intention of the code I reverted. It's up to you, @Kyle-Kyle.

Kyle-Kyle commented 2 weeks ago

^ just reported the above user and comment. Analyzing the malware atm. Fun

mahaloz commented 2 weeks ago

Same man, same lol.

Kyle-Kyle commented 2 weeks ago

wow. So many opaque predicates. Hopefully it is not something similar to movbfuscation

Kyle-Kyle commented 2 weeks ago

Step 1: a self-implemented switch-statement that will goto 0x411E59

Kyle-Kyle commented 2 weeks ago

oh shit. this is a VM! those numbers are essentially PCs! And v370 is the PC register

mahaloz commented 2 weeks ago

Opaque predicates? VM? Time to get Ashwin in here.

Kyle-Kyle commented 2 weeks ago

just played with it for a bit. Seems boring. After doing some pattern-based patching, the control flow now is pretty flat. And they do this for all functions. Shrug.

Kyle-Kyle commented 2 weeks ago

@mahaloz about the real top4grep issue, it was intended that we don't do partial match. It was because sometimes what we actually want is rust, but it returns trust, which is not great. Maybe we can add an option to make it possible.

ghost commented 2 weeks ago

Hey guys, don't click on the above link. I guess my account my compromised. For other's safety I'll delete my comment

mahaloz commented 2 weeks ago

@Kyle-Kyle it is so funny that you used that example, because that is exactly what just screwed me. I was searching for a Rust paper and kept getting trust... HA. Yeah maybe -e for exact.

However, the question was more about what should the default be? Is it really grep anymore if it does exact-match by default?

Kyle-Kyle commented 2 weeks ago

so, it is a dropper. it will create a d3d9x.dll during runtime, load it as a library and then jump to a shellcode, which I assume will use stuff from it. This is boring stuff. I expected malware to be harder to reverse than this. I'm not even a reverser. SMH. BTW, the main logic looks like this:

    sub_41CC80((int)v81);
    sub_40A8A0((int)"ntdll.dll");
    sub_40A8A0((int)"kernel32.dll");
    v146 = v81 + 8;
    sub_40C7B0(v81 + 8);
    v147 = sub_4098D0(*((struct _SECURITY_ATTRIBUTES **)v81 + 1), v81 + 8) & 1;
    if ( (v147 & 1) == 0 )
        return 1;
    v148 = v81 + 8;
    FileA = CreateFileA(v81 + 8, 0x80000000, 1u, 0, 3u, 0x80u, 0);
    *v82 = FileA;
    if ( *v82 == (HANDLE)-1 )
        return 1;
    FileSize = GetFileSize(*v82, 0);
    *v83 = FileSize;
    v4 = sub_4274DE(*v83);
    *v84 = v4;
    if ( ReadFile(*v82, *v84, *v83, lpNumberOfBytesRead, 0) )
    {
        sub_40C210(v81 + 8);
        CloseHandle(*v82);
        sub_40BF00(*v84, *v83);
        sub_40C7B0(v81 + 268);
        v5 = CreateFileA(v81 + 268, 0x40000000u, 0, 0, 2u, 0x80u, 0);
        *v86 = v5;
        if ( *v86 == (HANDLE)-1 )
        {
            v153 = *v84;
            if ( v153 )
                j_j__free(v153);
            return 1;
        }
        else
        {
            nNumberOfBytesToWrite = *v83;
            if ( WriteFile(*v86, *v84, nNumberOfBytesToWrite, lpNumberOfBytesWritten, 0) )
            {
                sub_40C210(v81 + 268);
                CloseHandle(*v86);
                v156 = *v84;
                v157 = v156 == 0;
                if ( v156 )
                    j_j__free(v156);
                LibraryA = LoadLibraryA(v81 + 268);
                *v88 = LibraryA;
                if ( *v88 == 0 )
                    return 0;
                sc = sub_40CD90(*v88, "ExitGame");
                *v89 = sc;
                if ( !*v89 )
                    return 0;
                *v90 = *v89;
                ((void (__cdecl *)(_BYTE *********, _BYTE ********, _BYTE *******, _BYTE ******, _BYTE *****, _BYTE ****, _BYTE ***, _BYTE **, _BYTE *))*v90)(
                    v8,
                    v9,
                    v10,
                    v11,
                    v12,
                    v13,
                    v14,
                    v15,
                    v16);
                hHandle = *v88;
                WaitForSingleObject(hHandle, 0xFFFFFFFF);
                CloseHandle(*v88);
                return 0;
            }
            else
            {
                CloseHandle(*v86);
                v155 = *v84;
                if ( v155 )
                    j_j__free(v155);
                return 1;
            }
        }
    }
    else
    {
        CloseHandle(*v82);
        Block = *v84;
        v152 = Block == 0;
        if ( Block )
            j_j__free(Block);
        return 1;
    }
DeviRule commented 2 weeks ago

@mahaloz I got annoyed by the rust example as well and tried to fix it in that PR. Also commit 6c94a0 should let you do some naive fuzzy match. For instance, if you search for 'patch,' it will return results that include both 'patch' and 'patching.'