find relevant security papers published in the top-4 conferences (S&P, USENIX, CCS, NDSS)
Grep function incorrectly only finds perfect matches now #10

Open mahaloz opened 2 weeks ago

mahaloz commented 2 weeks ago

Since, top4grep no longer does partial matches. Here is an example demonstrating it:

top4grep -k dec
[Top4Grep][INFO]08-26 16:13 Grep based on the following keywords: dec
[Top4Grep][DEBUG]08-26 16:13 Found 0 papers

Do it again with the full word:

top4grep -k decompilation
[Top4Grep][INFO]08-26 16:13 Grep based on the following keywords: decompilation
[Top4Grep][DEBUG]08-26 16:13 Found 3 papers
2023: IEEE S&P - Pyfet: Forensically Equivalent Transformation for Python Binary Decompilation.
2015: NDSS     - No More Gotos: Decompilation Using Pattern-Independent Control-Flow Structuring and Semantic-Preserving Transformations.
2013: USENIX   - Native x86 Decompilation Using Semantics-Preserving Structural Analysis and Iterative Control-Flow Structuring.

The Fix (hack)

I've fixed this by manually reverting the commit, resulting in the following diff:

diff --git a/top4grep/ b/top4grep/
index af5b804..2d99357 100644
--- a/top4grep/
+++ b/top4grep/
@@ -53,15 +53,8 @@ def grep(keywords, abstract):
         constraints = [Paper.title.contains(x) for x in keywords]
         with Session() as session:
             papers = session.query(Paper).filter(*constraints).all()
-        #check whether whether nltk tokenizer data is downloaded
-        check_and_download_punkt()
-        #tokenize the title and filter out the substring matches
-        filter_paper = []
-        for paper in papers:
-            if all([stemmer.stem(x.lower()) in fuzzy_match(paper.title.lower()) for x in keywords]):
-                filter_paper.append(paper)
     # perform customized sorthing
-    papers = sorted(filter_paper, key=lambda paper: paper.year + CONFERENCES.index(paper.conference)/10, reverse=True)
+    papers = sorted(papers, key=lambda paper: paper.year + CONFERENCES.index(paper.conference)/10, reverse=True)
     return papers

Which fixes the above example:

top4grep -k dec
[Top4Grep][INFO]08-26 16:15 Grep based on the following keywords: dec
[Top4Grep][DEBUG]08-26 16:15 Found 107 papers
2023: USENIX   - Understand Users' Privacy Perception and Decision of V2X Communication in Connected Autonomous Vehicles.
2023: USENIX   - MobileAtlas: Geographically Decoupled Measurements in Cellular Networks for Security and Privacy Research.
2023: USENIX   - VeriZexe: Decentralized Private Computation with Universal Setup.

I've not created a PR for this fix because it changes this project's functionality. It's unclear if the project now intends to match exact strings rather than substrings, which is the intention of the code I reverted. It's up to you, @Kyle-Kyle.

Kyle-Kyle commented 2 weeks ago

DeviRule commented 2 weeks ago

@mahaloz I got annoyed by the rust example as well and tried to fix it in that PR. Also commit 6c94a0 should let you do some naive fuzzy match. For instance, if you search for 'patch,' it will return results that include both 'patch' and 'patching.'