bellingcat / EDGAR

Tool for the retrieval of corporate and financial data from the SEC
https://colab.research.google.com/github/bellingcat/EDGAR/blob/main/notebook/Bellingcat_EDGAR_Tool.ipynb
GNU General Public License v3.0
128 stars 15 forks source link

Fix result duplication #30

Closed GalenReich closed 3 months ago

GalenReich commented 3 months ago

As @NauelSerraino pointed out in #26, the tool returns a large number of duplicate results. This is because of a mistake in how the search urls are encoded, including multiple page arguments in the url:

&page=1&page=2

This resulted in the first page of results being requested many times over and over 🤦‍♀️

This PR removes the leading page=1 from the url, so the final page argument is used instead.

Closes #26

GalenReich commented 3 months ago

Gah, I just double-checked before merging, it looks like this isn't sufficient to get the correct behaviour. It looks like the from argument is (also) needed.

The SEC Web interface queries &page=2&from=100, &page=3&from=200, etc

I'm not sure entirely why both arguments are needed, but the response is slightly different if 'page' is dropped... the plot thickens.

GalenReich commented 3 months ago

It's just the score that is affected, so as long a we are consistent this shouldn't be a problem. Have fixed and will merge