dgunning / edgartools

Navigate SEC Edgar data in Python
MIT License
445 stars 87 forks source link

Geeting the below Issue when using filing.text() #39

Closed Su-Ku-2000 closed 5 months ago

Su-Ku-2000 commented 5 months ago

2024-04-14 15:25:01,042 - root - INFO - Attachment for 0001047469-02-007674.txt -> EX-99.1.txt downloaded. Traceback (most recent call last): File "/Users/test.py", line 94, in download_filings_and_attachments(fillings10K, dir_path_10K) File "/Users/test.py", line 57, in download_filings_and_attachments f.write(filing.text()) File "/Users/sumithkumars/Library/Python/3.9/lib/python/site-packages/edgar/_filings.py", line 1671, in text return HtmlDocument.from_html(html_content).text File "/Users/sumithkumars/Library/Python/3.9/lib/python/site-packages/edgar/documents.py", line 422, in from_html root: Tag = cls.get_root(html) File "/Users/sumithkumars/Library/Python/3.9/lib/python/site-packages/edgar/documents.py", line 412, in get_root if "" in html[:500]: TypeError: a bytes-like object is required, not 'str'

Su-Ku-2000 commented 5 months ago

Some examples of filings where we are getting this error

2024-04-14 16:23:43,658 - root - INFO - a bytes-like object is required, not 'str' occurred for the filing --> 0000912057-00-023442 2024-04-14 16:23:43,821 - root - INFO - a bytes-like object is required, not 'str' occurred for the filing --> 0000912057-00-003201

2024-04-14 16:23:45,927 - root - INFO - a bytes-like object is required, not 'str' occurred for the filing --> 0000320193-97-000002 2024-04-14 16:23:46,135 - root - INFO - a bytes-like object is required, not 'str' occurred for the filing --> 0000320193-96-000018

2024-04-14 16:23:45,325 - root - INFO - a bytes-like object is required, not 'str' occurred for the filing --> 0000320193-98-000003

dgunning commented 5 months ago

Thanks for reporting. I've looked into this and we have to add special handling for old filings.

Su-Ku-2000 commented 5 months ago

Hey @dgunning , Thanks for checking this, so will this fix be implemented in the near future? I had one more request, can you please also expose a method that takes the html and returns the text content out of it?? So that we can get the text content of the linked exhibits as well?

dgunning commented 5 months ago

Fixed in 2.18.0.

Also see from edgar.documents import html_to_text

dgunning commented 5 months ago

Fixed