Refactoring text search + RSS feed parsing + adding CLI

wenlambdar commented 4 months ago

Changes in this PR:

Core

[x] Add all search parameters (e.g. entity CIK)
[x] Add exact search option (enclosing full text search keywords in brackets, e.g. "John Doe" instead of John Doe)
[x] Switch Regex parsing to HTML parsing for collecting data + add fail-safe parsing (prevents from failing because one field is missing)
[x] Add robust retries mechanism using tenacity library
[x] Add recursive date-range split when > 10000 results, to make sure we're not missing any results
[x] Make Selenium browser-specific utils generic and separate (might become its own library, who knows ? :) )
[x] Add headless browser as default behavior (left "visible" browser option for debugging purposes)
[x] Add CLI class separate from logic implementation using Fire library

Performance

[x] Make code evaluation lazy to avoid storing all results in-memory: each call appends to CSV, to avoid losing results due to eventual crashes

Portability

[x] Add more browsers for Selenium: Chrome, Firefox, Safari, Edge

Documentation

[x] Add type hints
[x] Add functions/classes documentation
[x] Add user guide for CLI

wenlambdar commented 4 months ago

Quick update:

Found a bug in the JSON export functionality when testing over 6 months of data of a large banking institution, nothing too bad, will only arise for requests of more than 10k results. I'll throw a fix either this evening or tomorrow, I know the problem, it should be very quick.

GalenReich commented 4 months ago

When testing using

python main.py text_search synergy vibration quantum --start_date "2000-01-01"

The following Selenium error was thrown but did not break the running of the script:

Successfully fetched URL: https://www.sec.gov/edgar/search/#/
q=synergy%2520vibration%2520quantum&dateRange=custom&startdt=2000-01-01&enddt=2024-02-21&page=1&page=1
Fetched 100 rows
Function <lambda> raised a NoSuchElementException error, returning None: Message: no such element: Unable to locate element: {"method":"tag name","selector":"a"}
  (Session info: chrome=121.0.6167.185); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
        GetHandleVerifier [0x00007FF75ED27012+3522402]
        (No symbol) [0x00007FF75E948352]
        (No symbol) [0x00007FF75E7F5ABB]
        (No symbol) [0x00007FF75E83BF0E]
        (No symbol) [0x00007FF75E83C08C]
        (No symbol) [0x00007FF75E8312AC]
        (No symbol) [0x00007FF75E85F09F]
        (No symbol) [0x00007FF75E83120A]
        (No symbol) [0x00007FF75E85F270]
        (No symbol) [0x00007FF75E87BDA3]
        (No symbol) [0x00007FF75E85EE03]
        (No symbol) [0x00007FF75E82F4D4]
        (No symbol) [0x00007FF75E8305F1]
        GetHandleVerifier [0x00007FF75ED59B9D+3730157]
        GetHandleVerifier [0x00007FF75EDAF02D+4079485]
        GetHandleVerifier [0x00007FF75EDA75D3+4048163]
        GetHandleVerifier [0x00007FF75EA7A649+718233]
        (No symbol) [0x00007FF75E954A3F]
        (No symbol) [0x00007FF75E94FA94]
        (No symbol) [0x00007FF75E94FBC2]
        (No symbol) [0x00007FF75E93F2E4]
        BaseThreadInitThunk [0x00007FFCC582257D+29]
        RtlUserThreadStart [0x00007FFCC6CAAA58+40]
        RtlUserThreadStart [0x00007FFCC6CAAA58+40]

Function <lambda> raised a AttributeError error, returning None: 'NoneType' object has no attribute 'get_attribute'
Function <lambda> raised a AttributeError error, returning None: 'NoneType' object has no attribute 'get_attribute'

wenlambdar commented 4 months ago

When testing using

python main.py text_search synergy vibration quantum --start_date "2000-01-01"

The following Selenium error was thrown but did not break the running of the script:

Nice to see you try it out !

This is why I want to replace print calls with proper logging. These errors are handled as expected and occur during parsing when one "column" is not found on a results row, which can occur depending on the type of document.

With a proper logging library these should be at DEBUG level, hence invisible unless the user asks for verbose output

(PS: as you may already have seen I have pushed changes regarding CSV/JSON exports fixes :) I don't think there's anything else I want to push to this PR apart from addressing eventual comments, WDYT ? Can keep logging for another PR I'd say)

GalenReich commented 4 months ago

I think my final bug is around the handling of newlines and commas in the response fields (particularly bad in csv output)

python main.py text_search Tsunami Hazards --start_date "2021-01-01" --end_date "2021-12-31"--entity-id "0000889169" --output "results.json"

And a snippet of the response:

"entity_name": "BNY Mellon Investment Grade Funds, Inc.\nDreyfus Institutional Reserves Funds\nBNY Mellon Sustainable U.S. Equity Fund, Inc.\nDREYFUS TAX EXEMPT CASH MANAGEMENT FUNDS\nDREYFUS GOVERNMENT CASH MANAGEMENT FUNDS\nDREYFUS CASH MANAGEMENT\nDreyfus Treasury Obligations Cash Management\nBNY MELLON SHORT-INTERMEDIATE MUNICIPAL BOND FUND\nDREYFUS TREASURY SECURITIES CASH MANAGEMENT\nDREYFUS AMT-FREE MUNICIPAL CASH MANAGEMENT PLUS\nDREYFUS AMT-FREE NEW YORK MUNICIPAL CASH MANAGEMENT\nDREYFUS INSTITUTIONAL PREFERRED MONEY MARKET FUNDS\nBNY MELLON OPPORTUNITY FUNDS\nCITIZENSSELECT FUNDS\nBNY Mellon Large Cap Securities Fund, Inc.\nDREYFUS LIQUID ASSETS, INC.", "company_cik": "0000889169", "place_of_business": "New York, NYNew York, NYNew York, NYNew York, NYNew York, NYNew York, NYNew York, NYNew York, NYNew York, NYNew York, NYNew York, NYNew York, NYNew York, NYNew York, NYNew York, NYNew York, NY",

wenlambdar commented 4 months ago

I think my final bug is around the handling of newlines and commas in the response fields (particularly bad in csv output)

Oh I hadn't noticed such cases yet, let me reproduce and check it out.

GalenReich commented 4 months ago

This is fantastic work and really elevates the EDGAR tool. It will enable a much wider range of people to access it, and hopefully use for interesting investigations! So thank you for all the effort you have put in 🙌

wenlambdar commented 4 months ago

Identified the bug while checking in-browser the request.

Haven't met this specific case of multiple Entities/CIKs/Places in the same row before. I think it's not so trivial to solve, will start working on the fix today.

wenlambdar commented 4 months ago

Encountering quite a few issues with stale DOM elements during parsing, occasionally causing entire pages to fail. Seems to occur more after I fixed the parsing to deal with multiline rows.

--> Will extract HTML as early as possible and move this Selenium-based, ugly parsing to BeautfiulSoup, should enable me to remove some code complexity.

Sorry this is taking a bit of time but the website is practically a scraping certification exam, got edge cases all over 😁 We're very close to the end just need to fix that to make sure we're not getting DOM-related failures.

GalenReich commented 4 months ago

Sorry this is taking a bit of time but the website is practically a scraping certification exam, got edge cases all over 😁 We're very close to the end just need to fix that to make sure we're not getting DOM-related failures.

Absolutely no need to apologise, web scraping is a pain at the best of times, let alone when you have something that looks like it's a coherent data structure and actually isn't!

wenlambdar commented 4 months ago

Done ! Ran it over 10 months of Goldman Sachs history (all document types included) to stress-test it a bit, got one of these weird stacktraces for stale element error after execution finished, but no serious problem seems to have occurred during execution, worked like a charm.

Exported the full file to Google Sheets to verify proper CSV formatting, worked fine (edgar_search_results_23022024_125817.csv).

Added better logging when a row fails, we'll know the exact search request URL and the row number so that users can easily locate the problem and eventually re-run the request - also useful for contributors to more easily debug any parsing problem.

GalenReich commented 4 months ago

That's great, definitely clearer parsing!

I'm still getting funky line-breaks in the csv output, but I think it's easy to fix now. I suggest removing the "\n".join() throughout and just passing the bare lists or unpacking in the event of a single item. I wrote it as:

def unpack_singleton_list(l: List) -> str | List:
    return l if len(l)!=1 else l[0]

Which I can commit and push if you're happy?

GalenReich commented 4 months ago

Also, was there a reason for removing json output?

GalenReich commented 4 months ago

Finally, do you think the parsing for the RSS feed needs updating?

wenlambdar commented 4 months ago

Also, was there a reason for removing json output?

Mistake of mine, re-adding it to constants.py should suffice

I suggest removing the "\n".join() throughout and just passing the bare lists or unpacking in the event of a single item.

Sounds good to me :)

Finally, do you think the parsing for the RSS feed needs updating?

I think it could be refactored, with better error handling at row-level and improved readability ! I can do !

wenlambdar commented 4 months ago

Refac of RSS parsing done !

msramalho commented 4 months ago

Thanks for all the work going into this refactoring @wenlambdar @opskov :1st_place_medal:

wenlambdar commented 4 months ago

Thanks @msramalho, pleasure was mine, quite a fun exercise to be honest :) ! (btw @opskov commits are just due to myself misconfiguring my local git settings 😅)

bellingcat / EDGAR

Refactoring text search + RSS feed parsing + adding CLI #10