Open katie-lamb opened 2 months ago
In the EDGAR database there may be duplicate filings for a company in the same year quarter and this is not a problem with the archiver (because maybe they've refiled or resubmitted). What we've done with FERC is take the most recent filing for each company and year quarter, this might be what we do for the 10K's too.
Seems like the only real issue is to make the CIK a string instead of an int.
I'm in the process of creating a training dataset and am realizing that it would be nice to have a primary key for the SEC 10K filing archive that refers to each filing uniquely. It seems like CIK is just the ID for a filing company, not a primary key for the filings. A few questions:
GCSArchive().get_metadata()
. CIK should probably be a string in the metadata? This might avoid conflicting CIKs for companies with very different names (see below)CIK = 1405332
has duplicate filings for2016q1