dgunning / edgartools

Python library for working with SEC Edgar
MIT License
324 stars 70 forks source link

ability to load previously saved filings #36

Closed ilias-ant closed 3 months ago

ilias-ant commented 3 months ago

In the spirit of ETL, it would be beneficial for downstream applications of edgartools to be able to discretize company filings persistence (e.g. in a storage backend like S3) and company filings parsing (e.g. extracting the balance sheet data) in order to engineer more flexible data pipelines.

The library right now provides the ability to persist a company filing (e.g. through .full_text_submission() on filing or through .download() on attachments) but - based on my experience as an application user - there doesn't seem to be a straightforward way to load a saved company filing and continue the parsing (e.g. filing.obj()) from there.

I suspect that this might not be the vision and philosophy of edgartools (and i totally respect it), just pitching the angle in order to discuss whether this resonates with you and the community.

p.s. I really, really like how the library is designed! Kudos for the effort, this is monumental work! <3 p.s.2 happy to help with the design / implementation of said feature (if it first passes feedback ofc)

dgunning commented 3 months ago

Thanks for this well written issue. I think it is a valid use case and I would like to make it more straightforward.

Do you mean

Is this just for a Filing or a Filings

Let me know and we can collaborate on this

ilias-ant commented 3 months ago

Hello!

yes, this is what i had in mind.

I think Filing should suffice, since it's the only object (afaik) that is part of the obj's signature.

def obj(sec_filing: Filing) -> Optional[object]:
    ...

So, it comes down to sensible serialization / deserialization of the Filing object, right?

dgunning commented 3 months ago

Couple considerations

  1. Protocol. Options are json, pickle. Pickle might be better
  2. Filename The accession_number is unique per filing so use a name like 0000000000-24-000001.pkl
  3. Directory It probably does not make sense to save in the current directory so require a directory path to be passed in.
  4. Cloud Would saving to a directory cover cloud serialization?
  5. Saving the Filings In addition to saving Filing if you save and restore Filings you can retrieve a collection of filings which would be good for ETL

Thoughts

dgunning commented 3 months ago

Release 2.12.0 with filing.save(). You can now save to a directory with the file automatically named by it's accession number e.g. 0000000000-24-000000.pkl or you can provide your own file name