bellingcat / EDGAR

Tool for the retrieval of corporate and financial data from the SEC
https://colab.research.google.com/github/bellingcat/EDGAR/blob/main/notebook/Bellingcat_EDGAR_Tool.ipynb
GNU General Public License v3.0
128 stars 15 forks source link

Enable searching for a mixture of exact and inexact keywords #13

Closed GalenReich closed 7 months ago

GalenReich commented 7 months ago

Currently keyword search is either exact (keyword order must match for all keywords) or inexact (and order of keywords may match).

# Join search keywords into a single string
keywords = " ".join(keywords)
keywords = f'"{keywords}"' if exact_search else keywords

It would be good if searches could use a mix of exact and inexact keyword matches:

i.e. "John Doe" Pharmaceuticals

wenlambdar commented 7 months ago

Great news, just tested it, it already works with current implementation:

python3 main.py text_search \"John Doe\" Pharmaceuticals -o "results_test.csv" --min_wait 3.0 --max_wait 4.0 -r 3 -b "chrome"

Will result in searching "John Doe" Pharmaceuticals in the search page, it's just a matter of escaping quotes. That being said, should we consider removing the exact_search flag ? We don't need it if we document properly this way to perform exact search.

wenlambdar commented 7 months ago

Hey :) I am working on this at the moment, I am suggesting removing the --exact_search flag in favor of the inline syntax e.g. \"John Doe\" only, because both can interact in an unexpected way and result in less precise searches than intended by the user, to reproduce you can check the difference in search parameters on the page between:


# With inline exact search keywords and exact_search = False 
python3 main.py text_search \"John Doe\" Pharmaceuticals -o "results_test.csv" --min_wait 3.0 --max_wait 4.0 -r 2 -b "chrome" -h False --exact_search False

# With inline exact search keywords and exact_search = True 
python3 main.py text_search \"John Doe\" Pharmaceuticals -o "results_test.csv" --min_wait 3.0 --max_wait 4.0 -r 2 -b "chrome" -h False --exact_search

For the first case the search text is as expected "John Doe" Pharmaceuticals

However when inline exact search keywords and exact_search flag are both used, an extra quote is added in the process, hence the rendered search text is ""John Doe" Pharmaceuticals (not sure why the closing quote is not added there), which returns broader results (checked on a few test searches).

I've updated the README and proposed a removal of exact_search flag in this PR: https://github.com/bellingcat/EDGAR/pull/16