bellingcat / EDGAR

Tool for the retrieval of corporate and financial data from the SEC
https://colab.research.google.com/github/bellingcat/EDGAR/blob/main/notebook/Bellingcat_EDGAR_Tool.ipynb
GNU General Public License v3.0
128 stars 15 forks source link

CSV output contains duplicates #26

Closed NauelSerraino closed 3 months ago

NauelSerraino commented 4 months ago

I noticed that the output contains a ton of duplicates, is this behaviour expected?

Reference:

import pandas as pd

CSV = r"C:\EDGAR\test\edgar_volcano_monitoring.csv"  # Output of text_search
df = pd.read_csv(CSV)

cols = df.columns.to_list()
df_no_dup = df.drop_duplicates(subset=cols)

print(f"Shape of df: {df.shape}")
print(f"Shape of df_no_dup: {df_no_dup.shape}")

>>> Shape of df: (1400, 16)
>>> Shape of df_no_dup: (101, 16)

Reproducibility:

poetry run edgar-tool text_search Volcano Monitoring

If this behaviour is not expected I would be happy to work on it

GalenReich commented 4 months ago

This is well spotted, and is a huge bug! It looks like every 100 rows of the output are duplicated - this is definitely not the right behaviour!

The key thing to establish will be if there are only say roughly 100 unique results, or if there are thousands and we are just duplicating the first 100 🤦

NauelSerraino commented 4 months ago

I think it is the latter that you said, but I'll check it out if I can work on it! :)

GalenReich commented 3 months ago

Hi @NauelSerraino, I just wanted to check in and see if you'd still like to work on this? I've been curious to why this duplication has been happening so I might have a look and share what I find here.

GalenReich commented 3 months ago

Apologies - as I was looking into this, I found the problem, and it's a straightforward fix!

NauelSerraino commented 3 months ago

@GalenReich great! Good to know, I'm sorry but it was a quite busy period and I didn't had time to check :)

GalenReich commented 3 months ago

Just reopening so the pr can close!