bellingcat / EDGAR

Tool for the retrieval of corporate and financial data from the SEC
https://colab.research.google.com/github/bellingcat/EDGAR/blob/main/notebook/Bellingcat_EDGAR_Tool.ipynb
GNU General Public License v3.0
95 stars 12 forks source link

Write test suite #11

Open jordan-gillard opened 4 months ago

jordan-gillard commented 4 months ago

I need to preface this issue by saying I love the work you do and that I think this software is very cool.

I notice that this repo doesn't have unit tests. The result of this is that breaking changes and regressions are difficult to catch. I know you guys want to publish EDGAR to pypi, so this is really no bueno. Ideally we'd want a robust test suite that runs in CI for every PR. I think this is especially relevant to PR #10, since it involves major code changes and therefore the strong potential to break existing behavior.

I'm interested in helping Bellingcat, so I'd be happy to work on this or take on an advisory role.

GalenReich commented 4 months ago

Hi Jordan, thanks for opening this issue! We would really welcome a PR for a test suite 🙌 Hopefully the refactor in #10 (now merged) will help with that as there are now more atomic functions.

If you wanted to join, we have a channel in our Discord dedicated to development of this tool https://discord.gg/bellingcat

JackCollins91 commented 1 month ago

Hi @jordan-gillard and @GalenReich I'd be happy to write a starting test suite.

My idea would be to engineer it such that the test functions go as far as formulating the URLs of the API calls and then validates that the URLs are properly formed, but validate nothing about what is returned to that call.

This way the test suite is not dependent on the EDGAR API to pass.

Then I could introduce a test which does try the API, but only raises a warning if it doesn't work. This way if there's an issue with the EDGAR API, we know there's a problem, but it doesn't cause unit testing to outright fail.

Let me know your thoughts. It is also fine to design the tests to go as far as hitting the API and retrieving results, we just need to establish some queries that we know will always return a certain result.

rkiddy commented 1 week ago

@jordan-gillard Is there anything in process on this? I am not seeing anything in your fork. Which mocking modules do you usually use? I can start something but would want to do something compatible with where you are going.

GalenReich commented 1 week ago

I'm going to unassign @jordan-gillard as we haven't heard from him in a little while, and it seems that there is some enthusiasm here!

Jordan, if you want to get back involved in an advisory role, just hop back in, we'd be grateful to have you!

For @JackCollins91 and @rkiddy, we have used pytest in other repos, so that might be the way to go here (though I don't have a strong opinion if you wanted to advocate for a different approach). I like Jack's idea of having something to test the response from the API - sudden changes to 3rd party APIs cause problems for us across our tools. Setting the start_date and end_date fields sufficiently in the past should give consistent results.

I'm going to assign Jack to this issue as he's posted a bit of starting detail, but I'd encourage you both to collaborate on this. Though if you'd rather work independently, I'd also welcome a PR to add tests to https://[ShadowFinder](https://github.com/bellingcat/ShadowFinder).

Let me know if you'd like to collaborate on this or be assigned elsewise.

JackCollins91 commented 6 days ago

Hey @rkiddy, I'm happy to implement this, but wouldn't have time until about September (maybe sooner, but not certain). If you're enthusiastic and have time before then, you're welcome to take over this PR and I'd be happy to chat/code review or advise, etc.

If not, no problem - i'll still get to this as soon as I clear my present crunch time.

Cheers

jordan-gillard commented 4 days ago

Hi folks! So sorry for being MIA. I've been swamped with some new work responsibilities.

@JackCollins91 I initially saw your comment on my phone in the morning, and I thought it came from Discord! I couldn't find it (obviously) and then brushed it off 😅

@rkiddy In my experience, almost all Python testing can be accomplished with Python's built-in unittest library and pytest.

As for testing, I think anything is better than nothing. When adding tests to existing code it often begins with integration tests. I definitely agree in @JackCollins91's assessment that we shouldn't use the actual EDGAR API. You could mock out network calls so that they always return a predetermined response. That's usually the way to go. This project uses the requests library, and mocking responses is pretty straight forward. I'd use a pytest fixture and the requests-mock library.

For example, if you were testing fetch_rss_feed in rss.py, you'd want to mock out the requests.get call here: https://github.com/bellingcat/EDGAR/blob/97b60fd6fa22d3987f08a0bc3134a4e8d9e276cc/edgar_tool/rss.py#L221

Your test file, lets call it test_rss.py, could accomplish this with a pytest fixture:

@pytest.fixture
def mock_requests_get():
    with requests_mock.Mocker() as m:
        yield m

and then the test would use the fixture and set the response you are testing against:

def test_fetch_rss_feed(mock_requests_get):
    """Tests that we do x when the api returns y
    yadda yadda yadda
    """
    # GIVEN
    sample_rss_feed_response = """
    <rss>
        <channel>
            <item>
                <title>Something something idk</title>
            </item>
        </channel>
    </rss>
    """
    mock_requests_get.get(RSS_FEED_URL, text=sample_rss_feed_response)

    # WHEN
    fetch_rss_feed(...)

    # THEN (though you'd surely want to test more than this)
    mock_requests_get.assert_called_once_with(RSS_FEED_URL, headers=headers)

Please don't wait on me to add tests! I will work on it today but tbh I don't even remember what I was up to when I last checked out the repo 😆

rkiddy commented 18 hours ago

Sorry if this seems lame, but can someone demonstrate how to run something in the local source. I do a lot with python, but mostly with flask apps and I am not familiar with this packaging style.

I tried:

% virtualenv .venv
% ./venv/bin/python3 -m pip install edgar-tool

But then, how do I execute local code? I made a change to the help string in both edger_tool/rss.py and in edgar_tool/.venv/lib/python3.10/site-packages/edgar_tool/rss.py and running this shows no change:

./.venv/bin/edgar-tool rss --help

I am sure I am confused about something here. If I can run something local, I can write tests.

JackCollins91 commented 9 hours ago

Hey @rkiddy no worries, mate. I'm not sure what machine you're using, but here's what I did:

1) download/install miniconda 2) create a virtual environment with minicinda, not venv conda create -n edgar conda activate edgar 3) cd <your Edgar repo path> 4) pip install the dependencies (and poetry if that's not included) 5) now you can run the CLI in a local dev mode by running CLU like: poetry run edgar-tool ...

Let me know if that works:)

GalenReich commented 8 hours ago

The project uses poetry to handle virtual environments, so even using conda or virtualenv isn't necessary (though to install poetry you're welcome to do what you like - poetry recommmends using pipx for the install pipx install poetry!).

As per the README:

Install poetry if you haven't already

pip install poetry

Install dependencies

poetry install

Run the tool

poetry run edgar-tool --help

When you do poetry install it creates a virtual environment managed by poetry (if you use an IDE that lets you select a python interpreter it will often autodetect the poetry environment). All dependencies will be installed in the poetry virtual environment. Then you can run commands within the virtual environment with poetry run .... Hope that helps! 🎉