dgunning / edgartools

Navigate SEC Edgar data in Python
MIT License
528 stars 105 forks source link

Accessing specific section of the financial statement/Foot Notes #98

Closed david08-08 closed 2 months ago

david08-08 commented 3 months ago

Good afternoon I would like to know if it is possible to import the tables to the footnotes into excel. As an example the EarningsPerShare statement I have included a screen shot of how it is coming into excel.

Screenshot 2024-08-30 at 4 24 33 PM

The code I am using is below. I am trying to figure this out any help would be greatly appreciated. Thank you.

from edgar import from edgar.financials import from edgar.xbrl import * from pandas import ExcelWriter from pathlib import Path import pandas as pd from bs4 import BeautifulSoup

def extract_and_clean_tables(html_string): """ Function to extract and clean tables from the HTML content. """ soup = BeautifulSoup(html_string, 'html.parser') tables = soup.find_all('table') cleaned_tables = []

for table in tables:
    rows = []
    for row in table.find_all('tr'):
        cells = row.find_all(['td', 'th'])
        row_data = [cell.get_text(strip=True) for cell in cells]
        rows.append(row_data)

    if rows:
        df = pd.DataFrame(rows)
        cleaned_tables.append(df)

return cleaned_tables

def write_statements(xbrl: XBRLData): output_path = Path('/Users/Documents/Apple Financials Test.xlsx') with output_path.open('wb') as f: with ExcelWriter(f, engine='xlsxwriter') as writer: workbook = writer.book number_format = workbook.add_format({'num_format': '0'}) # or '0.00' for two decimal places

        for index, statement in enumerate(xbrl.statements):
            df = statement.data.reset_index()

            if df is not None and not df.empty:
                # Remove HTML tags from the DataFrame
                df = df.applymap(lambda x: BeautifulSoup(str(x), "html.parser").get_text(separator=' ') if isinstance(x, str) else x)

                # Convert columns that should be numeric to the appropriate type
                df = df.apply(pd.to_numeric, errors='ignore')

                # Ensure the 'Year' columns are sorted in ascending order
                year_columns = sorted([col for col in df.columns if isinstance(col, str) and col.isdigit()])
                other_columns = [col for col in df.columns if col not in year_columns]
                df = df[other_columns + year_columns]  # Reorder DataFrame

                # Write the DataFrame to Excel
                df.to_excel(writer, sheet_name=statement.name[:30], index=False)

                # Apply the number format to all columns
                worksheet = writer.sheets[statement.name[:30]]
                for col_num, col in enumerate(df.columns):
                    worksheet.set_column(col_num, col_num, None, number_format)

                # Handle potential HTML content within the statement
                if hasattr(statement, 'html'):
                    html_string = statement.html
                    tables = extract_and_clean_tables(html_string)

                    for i, df_html in enumerate(tables):
                        if df_html is not None and not df_html.empty:
                            sheet_name = f'{statement.name[:30]}_html_{i}'
                            df_html.to_excel(writer, sheet_name=sheet_name, index=False)
                            worksheet_html = writer.sheets[sheet_name]
                            for col_num, col in enumerate(df_html.columns):
                                worksheet_html.set_column(col_num, col_num, None, number_format)

                # Print the DataFrame in IPython
                print(f"Statement: {statement.name}")
                display(df)

Example usage:

if name == 'main': filing = Company("AAPL").get_filings(form="10-K").latest(1) xbrl = filing.xbrl() write_statements(xbrl)

dgunning commented 3 months ago

It is hard to code this in a standard way so you will have to treat this a s one-off for Apple.

For example the name is "EarningsPerShareTables" for Apple but it will be different name for other companies. Also I found the location of the html inside the statement at nested_table = earnings_per_share_df.iloc[1,1]

This code tries two ways to extract the html .. neither of which is perfect. This is exected - HTML is a display technology not a data container.

from io import StringIO
import pandas as pd
from edgar.documents import HtmlDocument

# Earnings per share
earnings_per_share = xbrl.statements.get('EarningsPerShareTables')
earnings_per_share_df = earnings_per_share.get_dataframe().reset_index()
earnings_per_share_df

# Nested table 
nested_table = earnings_per_share_df.iloc[1,1]

def get_tables(text:str):
    # Use pandas read_html
    return pd.read_html(StringIO(text))

def get_tables2(text):
    # Use HtmlDocument
    doc = HtmlDocument.from_html(text)
    return [
        block.to_dataframe()
        for block in
        doc.get_table_blocks()
    ]

# Use pandas read_html
get_tables(nested_table)[0]

# Use HtmlDocument
get_tables2(nested_table)[0]
david08-08 commented 3 months ago

Thank you, Dwight. After running the code you provided it is still coming over like this. Have you tried this code on your end as well? If so did it work for you? Thank you.

Screenshot 2024-09-01 at 7 35 32 PM
david08-08 commented 3 months ago

Hi Dwight, sorry for the confusion this worked. Thank you for help. Just one question why do you think SegmentInfoandGeographi is not working? I assumed any tables I would be able to pull with the code provided. Is that not true?

david08-08 commented 3 months ago

I figured this out. The code you provide worked. Thank you Dwight.