Financial Report: Data Entry Enhancements

dayton-outar / shillings

An attempt at an immersive stock-trading platform

0 stars 0 forks source link

Financial Report: Data Entry Enhancements #4

Open dayton-outar opened 1 year ago

dayton-outar commented 1 year ago

To make manual entry of statement items easier for users, the following is suggested,

[ ] Implement key combinations to navigate and control statement item rows
[ ] Filter out sections and analytes that are expected to be itemized once
[x] Automatically set the analytes for sections that has one analyte left
[ ] Parse PDF to extract financial statements
[ ] Do I need to make accommodations for auditors and associate company profile to financial report audited by them
[ ] Hide descriptions for some items for which the section already full describes it
[ ] Install better currency amount input. Vue Currency
[ ] Implement reconciliation for profit sharing section. Also Shares Outstanding to reconcile against Earnings Per Stock and so on
[ ] Button to invert amount from positive to negative and vice-versa
[ ] Apply $'000 button to multiply all figures by 1,000
[ ] Checkbox to apply edits to multiple items at once
[ ] Prevent duplication of financial reports
[ ] Make templates to start a new report from
[ ] Allow forking of report for expert analyst to add their comments, shenanigans and add their intepretation and share with others
- Implement reviewing and fact-checking systems

dayton-outar commented 1 year ago

Implementing hot keys poses a challenge in that it clashes with the browsers hot keys. Maybe, should stick with point and click until a later date. Focus on more substantial value-added features for now.

dayton-outar commented 11 months ago

To automate entry via parsing annual and quarterly report. We will need an OCR solution such as Asprise OCR. Such a tool should be able to read the following image and output the data in a data structure that can be injected into the system.

1834-balance-sheet-march-2021

dayton-outar commented 11 months ago

More reports that can be tested on by the automation.

gk-balance-sheet-2010

dayton-outar commented 11 months ago

More.

genac-balance-sheet-2021

dayton-outar commented 11 months ago

A code snippet for using the Asprise OCR,

using System;
using AspriseOCR;

namespace OCRDemo
{
    class Program
    {
        static void Main(string[] args)
        {
            // Initialize the OCR engine
            Ocr.Setup();

            // Create an OCR instance
            Ocr ocr = new Ocr();

            // Start the OCR process with table detection enabled
            string ocrResult = ocr.Recognize("path_to_your_image.jpg", -1, -1, -1, -1, -1,
                Ocr.RECOGNIZE_TYPE_ALL, Ocr.OUTPUT_FORMAT_PLAINTEXT, 
                "prop.tableDetection=true|prop.tableExtractMethod=stream", "");

            // Output the OCR result
            Console.WriteLine(ocrResult);

            // Clean up
            ocr.Dispose();
        }
    }
}

dayton-outar commented 5 months ago

So, today I have found some more options for invoice or tabular data extraction solutions that use OCR. They are as follows:

dayton-outar commented 5 months ago

The ChatPDF online application serves as an example of the feature needed to embellish the graphical display of the annual report. So, with prompt engineering the line items in the financial statements can be classified and prepared for entry into the shillings database.

This issue will indeed operate in tandem with issue #51 and thus shares interest in a an extraction process that can intelligently classify the information provided in the annual reports that come in many variations.

dayton-outar commented 4 months ago

Automating the parsing of audited financial statements to extrapolate items from the balance sheet, profit/loss, and cashflow statements involves several steps. Here's a structured approach to achieving this:

1. Data Collection

Format: Determine the format of the financial statements (PDF, Excel, HTML, etc.).
Tool Selection: Choose appropriate tools for parsing based on the format.
- PDFs: Use libraries like PyPDF2, PDFMiner, or Tabula.
- Excel: Use libraries like OpenPyXL or Pandas.
- HTML: Use BeautifulSoup or Scrapy for web scraping.

2. Parsing the Documents

PDF Parsing:

Use pdfplumber or Tabula to extract tables.

Example with pdfplumber:

import pdfplumber

with pdfplumber.open("financial_statement.pdf") as pdf:
   for page in pdf.pages:
       tables = page.extract_tables()
       for table in tables:
           # Process table data

Excel Parsing:

Use Pandas to read and process Excel files.

Example with Pandas:

import pandas as pd

balance_sheet = pd.read_excel("financial_statement.xlsx", sheet_name="Balance Sheet")
profit_loss = pd.read_excel("financial_statement.xlsx", sheet_name="Profit and Loss")
cash_flow = pd.read_excel("financial_statement.xlsx", sheet_name="Cash Flow")

HTML Parsing:

Use BeautifulSoup for HTML parsing.

Example with BeautifulSoup:

from bs4 import BeautifulSoup

with open("financial_statement.html") as file:
   soup = BeautifulSoup(file, "html.parser")
   tables = soup.find_all("table")
   for table in tables:
       # Process table data

3. Data Extraction and Cleaning

Identify Key Sections: Locate the balance sheet, profit/loss, and cash flow sections.
Extract Data: Extract rows and columns into structured formats (e.g., lists, dictionaries, dataframes).
Cleaning Data: Handle missing values, standardize formats, and validate data.

4. Data Structuring

Define Data Models: Create data models for each financial statement section.

Map Data: Map extracted data to the defined models.

 class BalanceSheetItem:
     def __init__(self, item, value):
         self.item = item
         self.value = value

 balance_sheet_items = []
 for index, row in balance_sheet.iterrows():
     balance_sheet_items.append(BalanceSheetItem(row['Item'], row['Value']))

5. Automation Workflow

Script or Pipeline: Create a script or pipeline to automate the entire process.
Scheduled Runs: Use cron jobs or task schedulers for periodic runs.
Error Handling: Implement error handling and logging.

6. Data Storage and Retrieval

Database: Store parsed data in a structured database (SQL, NoSQL).
APIs: Create APIs to retrieve and use the parsed data.

7. Reporting and Analysis

Visualization Tools: Use tools like Matplotlib, Seaborn, or Tableau for visualization.
Automated Reports: Generate automated reports using libraries like ReportLab or Jinja2 for templating.

Example: End-to-End Workflow with Python

Here's a simple example combining some of the steps:

import pdfplumber
import pandas as pd

def parse_pdf(file_path):
    data = {"balance_sheet": [], "profit_loss": [], "cash_flow": []}
    with pdfplumber.open(file_path) as pdf:
        for page in pdf.pages:
            tables = page.extract_tables()
            for table in tables:
                # Process and classify table data into balance_sheet, profit_loss, or cash_flow
                pass
    return data

def main():
    file_path = "financial_statement.pdf"
    parsed_data = parse_pdf(file_path)

    # Example processing for balance_sheet
    balance_sheet_df = pd.DataFrame(parsed_data['balance_sheet'], columns=['Item', 'Value'])
    profit_loss_df = pd.DataFrame(parsed_data['profit_loss'], columns=['Item', 'Value'])
    cash_flow_df = pd.DataFrame(parsed_data['cash_flow'], columns=['Item', 'Value'])

    # Further processing, storage, and analysis
    print(balance_sheet_df)
    print(profit_loss_df)
    print(cash_flow_df)

if __name__ == "__main__":
    main()

dayton-outar commented 4 months ago

Handling different layouts of financial statements requires a flexible and adaptable approach in your automation. The strategy involves the following steps:

Identify Common Patterns and Variations: Analyze the different formats and identify common patterns, such as key section headers, common terms, and typical data structures.
Create Template-Based Parsing Rules: Develop a set of parsing rules or templates for each identified layout. Use these templates to guide the extraction process.
Implement Layout Detection: Build a mechanism to detect which layout a particular document follows. This can be based on specific keywords, section headers, or structural cues.
Modularize Parsing Logic: Structure your code in a modular way to handle different parsing rules based on the detected layout.

Here’s a step-by-step guide to building such an automation system:

Step-by-Step Guide

1. Analyze and Define Layouts

Manually inspect a sample of financial statements and document the variations in layout.
Identify key markers (e.g., section headers like "Balance Sheet", "Profit and Loss", "Cash Flow Statement").

2. Template-Based Parsing Rules

Create parsing functions for each layout type.

3. Layout Detection Mechanism

Develop a function to detect the layout based on specific keywords or structure.

4. Modular Parsing Logic

Create a main function to orchestrate the detection and appropriate parsing.

Example Implementation

a. Layout Detection

import re

def detect_layout(text):
    if "Balance Sheet" in text and "Profit and Loss" in text:
        return "layout_1"
    elif "Statement of Financial Position" in text and "Statement of Comprehensive Income" in text:
        return "layout_2"
    # Add more conditions for different layouts
    else:
        return "unknown"

b. Template-Based Parsing Functions

def parse_layout_1(text):
    balance_sheet = extract_section(text, "Balance Sheet", "Profit and Loss")
    profit_loss = extract_section(text, "Profit and Loss", "Cash Flow Statement")
    cash_flow = extract_section(text, "Cash Flow Statement", "End of Document")
    return balance_sheet, profit_loss, cash_flow

def parse_layout_2(text):
    balance_sheet = extract_section(text, "Statement of Financial Position", "Statement of Comprehensive Income")
    profit_loss = extract_section(text, "Statement of Comprehensive Income", "Cash Flow Statement")
    cash_flow = extract_section(text, "Cash Flow Statement", "End of Document")
    return balance_sheet, profit_loss, cash_flow

def extract_section(text, start_marker, end_marker):
    start_idx = text.find(start_marker)
    end_idx = text.find(end_marker, start_idx)
    if start_idx == -1 or end_idx == -1:
        return ""
    return text[start_idx:end_idx]

c. Modular Parsing Logic

def parse_financial_statements(text):
    layout = detect_layout(text)
    if layout == "layout_1":
        return parse_layout_1(text)
    elif layout == "layout_2":
        return parse_layout_2(text)
    else:
        raise ValueError("Unknown layout format")

def main():
    # Example OCR text extraction from images
    ocr_text = """..."""  # OCR output text here

    balance_sheet, profit_loss, cash_flow = parse_financial_statements(ocr_text)

    # Convert to DataFrame for further processing
    balance_sheet_df = pd.DataFrame(balance_sheet, columns=['Item', 'Value'])
    profit_loss_df = pd.DataFrame(profit_loss, columns=['Item', 'Value'])
    cash_flow_df = pd.DataFrame(cash_flow, columns=['Item', 'Value'])

    print(balance_sheet_df)
    print(profit_loss_df)
    print(cash_flow_df)

if __name__ == "__main__":
    main()

Additional Considerations

Preprocessing for OCR: Ensure images are preprocessed for better OCR accuracy (e.g., binarization, noise reduction).
Regular Expressions for Flexibility: Use regular expressions to capture variations in section headers and item formats.
Machine Learning for Layout Detection: Consider using machine learning models to classify document layouts based on features extracted from the text.
Continuous Improvement: As new formats are encountered, update the layout detection and parsing logic.
Error Handling: Implement robust error handling to manage unknown formats or parsing errors gracefully.
Documentation and Testing: Document different layouts and thoroughly test the parsing functions with various samples to ensure accuracy.

By following this structured approach, you can build a flexible automation system capable of handling various layouts of financial statements.

dayton-outar commented 4 months ago

So, balance sheets or statement of financial positions (and income-expense and cashflow statements) consist of a list of descriptions of each aggregated item and the totals associated.

An LLM will be required to read the extracted data and classify each line item described in Sectional and Assays. So, each line item except those that computed lines like Total Assets and Total Liabilities, will have to be in a section of any shown below,

None
Revenues
Expenses
Gains
Losses
Profit Share
Earnings per Stock
Shares Outstanding
Assets
Liabilities
Equity
Operating Activities
Financing Activities
Investing Activities

A line item can be associated with more than one Assays. For example, a line item that describes expenses such as Direct and Operating Expenses can be associated with Direct and Operating expenses. This is achieved in the API using bit flags.

dayton-outar commented 1 month ago

So, major major developments happened during my break and I was inspired by How to use ChatGPT to analyze PDFs for free to try out a prompt on ChatGPT pertaining to this particular enhancement.

I decided to use a quarterly report from GraceKennedy. This document had no images in it. See below a snapshot of one of the 14 page document.

Screenshot from 2024-08-28 06-15-26

I uploaded the 14 page document and added the following prompt to associate with the document,

Give me a JSON string of the income statement that does not include aggregated items except earnings per stock unit (or eps or earnings per share) and categorize each item in the statement as an array item that consist of the following properties,

The description of the line item

The section within the statement that can be either: Income-Expense, Earnings per Stock

Categorize the line item in all appropriate item classes from this list into an array property called analyte: Operating, Direct, Depreciation, Impairment, Interest, Other, Investment, Tax, Shareholders, NonControlling, Basic, Diluted, Cash, Inventories, Receivables, Payables, Current, Fixed, Loan, Intangible, Capital, Retained, Reserves, Dividends, Shares, Treasury, Wages, ExchangeRateChanges, Adjustments, Note.

The amount associated, in the second column only, with that line item

The results were delightful.

This is what I got.

" Here is the JSON string representing the income statement with each item categorized as requested:

{
  "income_statement": [
    {
      "description": "Revenue from products and services",
      "section": "Income-Expense",
      "analyte": ["Operating", "Direct", "Current"],
      "amount": 92355966
    },
    {
      "description": "Interest revenue",
      "section": "Income-Expense",
      "analyte": ["Investment", "Interest", "Other"],
      "amount": 3420263
    },
    {
      "description": "Direct and operating expenses",
      "section": "Income-Expense",
      "analyte": ["Operating", "Direct", "Wages", "Cash"],
      "amount": 89499744
    },
    {
      "description": "Net impairment losses on financial assets",
      "section": "Income-Expense",
      "analyte": ["Impairment", "Loan"],
      "amount": 392404
    },
    {
      "description": "Other income",
      "section": "Income-Expense",
      "analyte": ["Operating", "Other"],
      "amount": 2712404
    },
    {
      "description": "Interest income – non-financial services",
      "section": "Income-Expense",
      "analyte": ["Interest", "Other"],
      "amount": 433576
    },
    {
      "description": "Interest expense – non-financial services",
      "section": "Income-Expense",
      "analyte": ["Interest", "Loan", "Payables"],
      "amount": 858262
    },
    {
      "description": "Share of results of associates and joint ventures",
      "section": "Income-Expense",
      "analyte": ["Investment", "Shareholders"],
      "amount": 368294
    },
    {
      "description": "Taxation",
      "section": "Income-Expense",
      "analyte": ["Tax", "Cash"],
      "amount": 2305826
    },
    {
      "description": "Net Profit for the period",
      "section": "Income-Expense",
      "analyte": ["Operating", "Retained", "Shareholders"],
      "amount": 6234267
    },
    {
      "description": "Profit attributable to Owners of GraceKennedy Limited",
      "section": "Income-Expense",
      "analyte": ["Operating", "Retained", "Shareholders"],
      "amount": 5677684
    },
    {
      "description": "Profit attributable to Non-controlling interests",
      "section": "Income-Expense",
      "analyte": ["NonControlling", "Operating"],
      "amount": 556583
    },
    {
      "description": "Earnings per Stock Unit - Basic",
      "section": "Earnings per Stock",
      "analyte": ["Basic", "Shares", "Note"],
      "amount": 5.73
    },
    {
      "description": "Earnings per Stock Unit - Diluted",
      "section": "Earnings per Stock",
      "analyte": ["Diluted", "Shares", "Note"],
      "amount": 5.69
    }
  ]
}

This JSON structure includes each line item from the income statement, categorized by its section, analytes, and associated amounts. "

This saves a lot of time and opens up possibilities for efficient use of data entry time.

The big question is now: how big a file can GPT handle?

dayton-outar commented 1 month ago

After realizing that GPT is recognizing the "first" column as first in it's category and not first overall, I corrected by prompt as follows,

Excellent. I just realized that I want this for the first column of amounts

"Here is the JSON string with the amounts from the first column:

{
  "income_statement": [
    {
      "description": "Revenue from products and services",
      "section": "Income-Expense",
      "analyte": ["Operating", "Direct", "Current"],
      "amount": 31275213
    },
    {
      "description": "Interest revenue",
      "section": "Income-Expense",
      "analyte": ["Investment", "Interest", "Other"],
      "amount": 1149675
    },
    {
      "description": "Direct and operating expenses",
      "section": "Income-Expense",
      "analyte": ["Operating", "Direct", "Wages", "Cash"],
      "amount": 30248373
    },
    {
      "description": "Net impairment losses on financial assets",
      "section": "Income-Expense",
      "analyte": ["Impairment", "Loan"],
      "amount": 162481
    },
    {
      "description": "Other income",
      "section": "Income-Expense",
      "analyte": ["Operating", "Other"],
      "amount": 1185127
    },
    {
      "description": "Interest income – non-financial services",
      "section": "Income-Expense",
      "analyte": ["Interest", "Other"],
      "amount": 157484
    },
    {
      "description": "Interest expense – non-financial services",
      "section": "Income-Expense",
      "analyte": ["Interest", "Loan", "Payables"],
      "amount": 300416
    },
    {
      "description": "Share of results of associates and joint ventures",
      "section": "Income-Expense",
      "analyte": ["Investment", "Shareholders"],
      "amount": 38715
    },
    {
      "description": "Taxation",
      "section": "Income-Expense",
      "analyte": ["Tax", "Cash"],
      "amount": 835635
    },
    {
      "description": "Net Profit for the period",
      "section": "Income-Expense",
      "analyte": ["Operating", "Retained", "Shareholders"],
      "amount": 2259309
    },
    {
      "description": "Profit attributable to Owners of GraceKennedy Limited",
      "section": "Income-Expense",
      "analyte": ["Operating", "Retained", "Shareholders"],
      "amount": 2090387
    },
    {
      "description": "Profit attributable to Non-controlling interests",
      "section": "Income-Expense",
      "analyte": ["NonControlling", "Operating"],
      "amount": 168922
    },
    {
      "description": "Earnings per Stock Unit - Basic",
      "section": "Earnings per Stock",
      "analyte": ["Basic", "Shares", "Note"],
      "amount": 2.11
    },
    {
      "description": "Earnings per Stock Unit - Diluted",
      "section": "Earnings per Stock",
      "analyte": ["Diluted", "Shares", "Note"],
      "amount": 2.09
    }
  ]
}

This JSON structure now includes the amounts from the first column of the income statement. "

dayton-outar commented 1 month ago

I find that uploading a 9.7 MB file takes less than a minute and reading it takes about 15 seconds.

However, after testing GPT with a document that contains images of the report failed to read the report accurately and produced inaccurate results.

dayton-outar commented 1 month ago

So, for GPT took 2 minutes to upload 2016 NCB Annual Report, which was 49 MB, a relatively big file. This file seemed to have some layer of protection over the audited financial statement. So, GPT was able to read some but not all of the line items in the income statement. Some of the figures were accurate but some were inaccurate (you could see it reading the first part but not the entire figure).

This is not bad but to save time all the information need to be read accurately. So, it looks like whenever the textual information can be accessed in the PDF document, GPT will read it accurately. However, if the statement is in an image or protected, then GPT will deliver inaccurate and incomplete information.

dayton-outar commented 1 month ago

So, attempting to use GPT to read 2020 NCB Annual Report, which was 14.5 MB.

This was more promising. The financial statements were not protected and were not placed in an image in the document. See below the results from same GPT prompt used for income statements,

Based on the provided NCB Financial Group Limited Annual Report 2020, I have extracted the relevant data from the income statement and structured it in the requested JSON format:

{
  "IncomeStatement": [
    {
      "description": "Interest income",
      "section": "Income-Expense",
      "analyte": ["Operating", "Investment"],
      "amount": 74421878
    },
    {
      "description": "Interest expense",
      "section": "Income-Expense",
      "analyte": ["Operating", "Interest"],
      "amount": -21932169
    },
    {
      "description": "Net interest income",
      "section": "Income-Expense",
      "analyte": ["Operating", "Investment"],
      "amount": 52489709
    },
    {
      "description": "Fee and commission income",
      "section": "Income-Expense",
      "analyte": ["Operating"],
      "amount": 25925325
    },
    {
      "description": "Fee and commission expense",
      "section": "Income-Expense",
      "analyte": ["Operating"],
      "amount": -4555918
    },
    {
      "description": "Net fee and commission income",
      "section": "Income-Expense",
      "analyte": ["Operating"],
      "amount": 21369407
    },
    {
      "description": "Gain on foreign currency and investment activities",
      "section": "Income-Expense",
      "analyte": ["Operating", "Investment"],
      "amount": 8793286
    },
    {
      "description": "Credit impairment losses",
      "section": "Income-Expense",
      "analyte": ["Operating", "Impairment"],
      "amount": -10284994
    },
    {
      "description": "Dividend income",
      "section": "Income-Expense",
      "analyte": ["Operating", "Investment"],
      "amount": 1901300
    },
    {
      "description": "Other operating income",
      "section": "Income-Expense",
      "analyte": ["Operating", "Other"],
      "amount": 2102190
    },
    {
      "description": "Net result from banking and investment activities",
      "section": "Income-Expense",
      "analyte": ["Operating", "Investment"],
      "amount": 76370898
    },
    {
      "description": "Premium income",
      "section": "Income-Expense",
      "analyte": ["Operating", "Investment"],
      "amount": 135202001
    },
    {
      "description": "Insurance premium ceded to reinsurers",
      "section": "Income-Expense",
      "analyte": ["Operating"],
      "amount": -42003606
    },
    {
      "description": "Reinsurance commission income",
      "section": "Income-Expense",
      "analyte": ["Operating"],
      "amount": 8470579
    },
    {
      "description": "Net underwriting income",
      "section": "Income-Expense",
      "analyte": ["Operating"],
      "amount": 101668974
    },
    {
      "description": "Gross policyholders’ and annuitants’ benefits and reserves",
      "section": "Income-Expense",
      "analyte": ["Operating"],
      "amount": -63223937
    },
    {
      "description": "Reinsurance on policyholders’ and annuitants’ benefits and reserves",
      "section": "Income-Expense",
      "analyte": ["Operating"],
      "amount": 7016737
    },
    {
      "description": "Commission and other selling expenses",
      "section": "Income-Expense",
      "analyte": ["Operating"],
      "amount": -13005783
    },
    {
      "description": "Net result from insurance activities",
      "section": "Income-Expense",
      "analyte": ["Operating", "Investment"],
      "amount": 32455991
    },
    {
      "description": "Net operating income",
      "section": "Income-Expense",
      "analyte": ["Operating"],
      "amount": 108826889
    },
    {
      "description": "Staff costs",
      "section": "Income-Expense",
      "analyte": ["Operating", "Wages"],
      "amount": 40526668
    },
    {
      "description": "Depreciation and amortisation",
      "section": "Income-Expense",
      "analyte": ["Operating", "Depreciation"],
      "amount": 8529471
    },
    {
      "description": "Finance cost",
      "section": "Income-Expense",
      "analyte": ["Operating", "Interest"],
      "amount": 1411727
    },
    {
      "description": "Other operating expenses",
      "section": "Income-Expense",
      "analyte": ["Operating"],
      "amount": 31097938
    },
    {
      "description": "Operating Profit",
      "section": "Income-Expense",
      "analyte": ["Operating"],
      "amount": 27261085
    },
    {
      "description": "Taxation",
      "section": "Income-Expense",
      "analyte": ["Tax"],
      "amount": -690064
    },
    {
      "description": "NET PROFIT",
      "section": "Income-Expense",
      "analyte": ["Operating"],
      "amount": 26883412
    },
    {
      "description": "Earnings per stock unit",
      "section": "Earnings per Stock",
      "analyte": ["Shares"],
      "amount": 8.01
    }
  ]
}

This JSON structure adheres to the format you requested, categorizing each line item based on its description, section, analyte classification, and the associated amount.

dayton-outar commented 1 month ago

If you want to avoid manual editing of text or tables and retain the format, you can follow this more automated approach:

1. Use an Advanced PDF Tool with OCR and Format Retention

Certain PDF tools are designed to extract content and retain the exact formatting automatically. These tools can convert the scanned images to text (via OCR) and preserve the layout (tables, images, fonts, etc.).

Here are a few options:

Adobe Acrobat Pro (Automated Process)

Open your scanned PDF.
Select Tools > Enhance Scans.
Choose Recognize Text and select the option to recognize text in the entire document.
After OCR is completed, save the document. This will convert the images into selectable text, while maintaining the layout.
You can then export the document as a new PDF while preserving the formatting.

ABBYY FineReader

ABBYY FineReader is known for its high-quality OCR and ability to keep the layout intact:
Import your PDF into ABBYY FineReader.
Run the OCR process.
Choose the option to maintain layout.
Export the document as a PDF. The software will automatically recreate the tables, fonts, and text boxes in the correct format.

PDF XChange Editor
Open the scanned PDF in PDF XChange Editor.
Go to Document > OCR.
Select the option to retain the original layout.
Once the OCR completes, save the document as a new PDF.

Online Tools

You can also try online platforms like:
Online2PDF: Allows converting scanned PDFs to editable PDFs with layout retention.
Smallpdf OCR: Upload the scanned PDF, run OCR, and download the result with the layout preserved.

2. Batch OCR Automation with Scripting (if you want to code it)

If you'd prefer an automated, programmatic approach, you can use scripting to process the PDF and maintain its format:

Use Python with Tesseract and the pdfplumber or pdf2image libraries to extract images and run OCR.
Use pdfplumber or PyMuPDF to extract the layout data (positions, tables).
Rebuild the PDF automatically using Python's reportlab or pdfkit libraries, adding text and images at the correct positions.

Would you like help setting up one of these tools or methods?

dayton-outar commented 1 month ago

Here's a Node.js script that reads the CSV file, processes each row, and generates SQL INSERT commands for a table called [stocks].[dbo].[StockIndices]. You'll need the csv-parser package to read the CSV file.

Steps:

Install the csv-parser package:
```
npm install csv-parser
```
Here’s the script:

const fs = require('fs');
const csv = require('csv-parser');

// Path to your CSV file
const filePath = 'path/to/your/file.csv';

const insertSQL = [];

fs.createReadStream(filePath)
  .pipe(csv())
  .on('data', (row) => {
    const { Date, MarketIndexNo, Value } = row;

    // Create SQL INSERT statement for each row
    const sql = `
      INSERT INTO [stocks].[dbo].[StockIndices] (Date, MarketIndexNo, Value)
      VALUES ('${Date}', ${MarketIndexNo}, ${Value});
    `;

    insertSQL.push(sql);
  })
  .on('end', () => {
    console.log('CSV file processed. SQL commands:');
    insertSQL.forEach(sql => console.log(sql));

    // Optionally write the SQL commands to a file
    fs.writeFileSync('insert_commands.sql', insertSQL.join('\n'));
  })
  .on('error', (err) => {
    console.error('Error reading CSV file:', err);
  });

How it works:

The script reads the CSV file using csv-parser.
For each row, it constructs an INSERT statement, escaping the Date, MarketIndexNo, and Value columns.
The SQL command is pushed to the insertSQL array.
After processing all rows, it outputs the SQL commands to the console and optionally saves them to a file called insert_commands.sql.

Example output:

If your CSV file has the following rows:

Date,MarketIndexNo,Value
2024-09-07,1,100.5
2024-09-08,2,101.7

It will generate:

INSERT INTO [stocks].[dbo].[StockIndices] (Date, MarketIndexNo, Value)
VALUES ('2024-09-07', 1, 100.5);

INSERT INTO [stocks].[dbo].[StockIndices] (Date, MarketIndexNo, Value)
VALUES ('2024-09-08', 2, 101.7);

Notes:

Ensure the column data types match the table schema.
If the Date field isn't in the correct format for SQL Server, you may need to adjust it.

dayton-outar commented 1 month ago

Common Table Expression SELECT Query to get ValueChange value from new market indices records,

WITH cte AS (
    SELECT 
        i.[No],
        i.[Date],
        i.[MarketIndexNo],
        i.[Value],
        LAG(i.[Value]) OVER (PARTITION BY i.[MarketIndexNo] ORDER BY i.[Date]) AS PrevValue
    FROM [stocks].[dbo].[StockIndices] i
)
SELECT 
    cte.[Date],
    cte.[MarketIndexNo],
    cte.[Value],
    cte.[Value] - cte.[PrevValue] AS ValueChange,
    si.[MarketIndexNo],
    MI.[Name]
FROM [cte]
    INNER JOIN [stocks].[dbo].[StockIndices] si ON si.[No] = cte.[No]
    INNER JOIN [stocks].[dbo].[MarketIndices] mi ON mi.[No] = si.[MarketIndexNo]
WHERE cte.Date = si.Date
    AND cte.MarketIndexNo = si.MarketIndexNo
    AND si.[LogNo] = 176317;

dayton-outar commented 1 month ago

Here's the update query that was eventually used,

WITH cte AS (
    SELECT 
        i.[No],
        i.[Date],
        i.[MarketIndexNo],
        i.[Value],
        LAG(i.[Value]) OVER (PARTITION BY i.[MarketIndexNo] ORDER BY i.[Date]) AS PrevValue
    FROM [stocks].[dbo].[StockIndices] i
)
UPDATE si
SET [ValueChange] = ISNULL(cte.[Value] - cte.PrevValue, 0)
FROM [cte]
    INNER JOIN [stocks].[dbo].[StockIndices] si ON si.[No] = cte.[No]
    INNER JOIN [stocks].[dbo].[MarketIndices] mi ON mi.[No] = si.[MarketIndexNo]
WHERE cte.[Date] = si.[Date]
    AND cte.MarketIndexNo = si.MarketIndexNo
    AND si.[LogNo] = 176317;

dayton-outar commented 1 month ago

A query to review the data,

SELECT  i.[Value]
      , i.[ValueChange]
      , i.[Date]
      , l.[Logged]
      , i.[MarketIndexNo]
      , m.[Name]
  FROM [stocks].[dbo].[StockIndices] i
    INNER JOIN [stocks].[dbo].[Logs] l ON l.[No] = i.[LogNo]
    INNER JOIN [stocks].[dbo].[MarketIndices] m ON m.[No] = i.[MarketIndexNo]
WHERE i.[LogNo] = 176317
ORDER BY l.[Logged] DESC

dayton-outar commented 1 month ago

Dividends information is tricky in that many companies that were de-listed no longer have their corporate actions available on Jamaica Stock Exchange.

dayton-outar commented 1 month ago

So ....

I printed a page of the scanned document of 2016 NCB Annual Report, in particular page 153 that has the "Consolidated Income Statement". I then used ocrmypdf to convert the 2016 NCB Annual Report into a format that maintained the layout but was readable by the GPT (meaning I could extract the text).

NCB-Annual-Report-2016.pdf

The ocrmypdf command was as follows,

ocrmypdf --redo-ocr NCB-Annual-Report-2016.pdf NCB-Annual-2016-converted.pdf

Screenshot from 2024-09-08 06-53-33

NCB-Annual-2016-converted.pdf

The output was positive. In fact, the output from the GPT delivered accurate readings,

[
  {
    "description": "Interest income",
    "section": "Income-Expense",
    "analyte": ["Operating", "Interest"],
    "amount": 39156349
  },
  {
    "description": "Interest expense",
    "section": "Income-Expense",
    "analyte": ["Operating", "Interest"],
    "amount": -11032579
  },
  {
    "description": "Net interest income",
    "section": "Income-Expense",
    "analyte": ["Operating", "Interest"],
    "amount": 28123770
  },
  {
    "description": "Fee and commission income",
    "section": "Income-Expense",
    "analyte": ["Operating"],
    "amount": 13575872
  },
  {
    "description": "Fee and commission expense",
    "section": "Income-Expense",
    "analyte": ["Operating"],
    "amount": -2663405
  },
  {
    "description": "Net fee and commission income",
    "section": "Income-Expense",
    "analyte": ["Operating"],
    "amount": 10912467
  },
  {
    "description": "Gain on foreign currency and investment activities",
    "section": "Income-Expense",
    "analyte": ["Investment"],
    "amount": 4736122
  },
  {
    "description": "Premium income",
    "section": "Income-Expense",
    "analyte": ["Other"],
    "amount": 7991693
  },
  {
    "description": "Dividend income",
    "section": "Income-Expense",
    "analyte": ["Investment"],
    "amount": 149921
  },
  {
    "description": "Other operating income",
    "section": "Income-Expense",
    "analyte": ["Other"],
    "amount": 143736
  },
  {
    "description": "Staff costs",
    "section": "Income-Expense",
    "analyte": ["Wages"],
    "amount": 13809023
  },
  {
    "description": "Provision for credit losses",
    "section": "Income-Expense",
    "analyte": ["Other"],
    "amount": 612355
  },
  {
    "description": "Policyholders’ and annuitants’ benefits and reserves",
    "section": "Income-Expense",
    "analyte": ["Reserves"],
    "amount": 4292643
  },
  {
    "description": "Depreciation and amortisation",
    "section": "Income-Expense",
    "analyte": ["Depreciation"],
    "amount": 1899414
  },
  {
    "description": "Impairment losses on securities",
    "section": "Income-Expense",
    "analyte": ["Impairment"],
    "amount": 79765
  },
  {
    "description": "Other operating expenses",
    "section": "Income-Expense",
    "analyte": ["Operating", "Other"],
    "amount": 13348202
  },
  {
    "description": "Share of profit of associates",
    "section": "Income-Expense",
    "analyte": ["Other"],
    "amount": 832480
  },
  {
    "description": "Loss on dilution of investment in associate",
    "section": "Income-Expense",
    "analyte": ["Investment"],
    "amount": -50748
  },
  {
    "description": "Taxation",
    "section": "Income-Expense",
    "analyte": ["Tax"],
    "amount": -4479992
  },
  {
    "description": "Net profit attributable to stockholders of the Bank",
    "section": "Income-Expense",
    "analyte": ["Shareholders"],
    "amount": 15636446
  },
  {
    "description": "Net profit attributable to non-controlling interest",
    "section": "Income-Expense",
    "analyte": ["NonControlling"],
    "amount": -1187886
  },
  {
    "description": "Basic earnings per stock unit",
    "section": "Earnings per Stock",
    "analyte": ["Basic", "Shares"],
    "amount": 6.35
  },
  {
    "description": "Diluted earnings per stock unit",
    "section": "Earnings per Stock",
    "analyte": ["Diluted", "Shares"],
    "amount": 5.87
  }
]

Reference

dayton-outar commented 1 month ago

Yes, you can use pdfimages, a command-line tool available in Ubuntu, to check for pages that contain scanned images in a PDF. This tool is part of the poppler-utils package and is useful for extracting images from PDF files, but you can also use it to identify which pages contain raster (scanned) images.

Steps to Identify Pages with Scanned Images:

Install pdfimages (if not already installed):

sudo apt update
sudo apt install poppler-utils

Extract images from the PDF: Run pdfimages on your PDF file to extract all the images and see which pages contain them. For example:
```
pdfimages -list input.pdf
```
This command lists all the images in the PDF, along with their respective page numbers. The output includes:
- The page number where the image is located.
- The resolution and size of the images.

Example Output:

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1654  2339  rgb     3   8  image  no         9  0   300   300  251K  4.1%
   2     1 image    1654  2339  rgb     3   8  image  no        14  0   300   300  245K  4.0%

page: The page number where the image is found.
num: The image number.
type: The type of image (e.g., image means a raster image).
x-ppi, y-ppi: The image resolution in pixels per inch, which is typical for scanned documents.

How to Identify Scanned Pages:

Pages with large images that have high PPI (e.g., 300x300) are likely to be scanned pages.
Pages without text-based content or vector elements are typically scanned documents, and they will show as raster images in the pdfimages output.

Example Command for Checking Scanned Pages:

pdfimages -list input.pdf | grep -E "^ *[0-9]+ +image"

This will list only the pages where raster images are found.

Additional Tools for PDF Analysis:

If you need more sophisticated analysis (e.g., identifying whether a page is purely scanned without text layers), you can combine this with:

OCRmyPDF: It will only OCR pages with scanned images.
pdffonts (also part of poppler-utils): This can tell you if a page has text content (i.e., fonts), which means it’s not purely scanned. Run:
```
pdffonts input.pdf
```

Conclusion:

pdfimages is the best tool for identifying which pages in a PDF contain scanned images by listing all images and their associated pages.

dayton-outar commented 1 month ago

You can capture the output from the pdfimages command and transform it into a JSON format using shell tools like awk or sed, along with a simple script. Here’s a method to do this in Ubuntu.

1. Extracting PDF Image Information and Converting to JSON

First, you can run the pdfimages command and parse its output, converting it into JSON format. Here’s a simple bash script to achieve that.

Step-by-Step Approach:

Run the pdfimages -list command to get the output.
```
pdfimages -list input.pdf
```
Use Bash with Awk or Sed to format the output into JSON.

Here’s an example bash script that converts the pdfimages -list output into JSON:

#!/bin/bash

# Get the list of images from the PDF
pdfimages -list "$1" | awk '
BEGIN {
    # Initialize the JSON array
    print "["
    first = 1
}
NR > 2 {
    # Skip the header lines
    if (first == 0) {
        print ","
    }
    first = 0

    # Print each entry in JSON format
    printf "  {\n"
    printf "    \"page\": %d,\n", $1
    printf "    \"num\": %d,\n", $2
    printf "    \"type\": \"%s\",\n", $3
    printf "    \"width\": %d,\n", $4
    printf "    \"height\": %d,\n", $5
    printf "    \"color\": \"%s\",\n", $6
    printf "    \"comp\": %d,\n", $7
    printf "    \"bpc\": %d,\n", $8
    printf "    \"enc\": \"%s\",\n", $9
    printf "    \"interp\": \"%s\",\n", $10
    printf "    \"object_id\": \"%d %d\",\n", $11, $12
    printf "    \"x_ppi\": %d,\n", $13
    printf "    \"y_ppi\": %d,\n", $14
    printf "    \"size\": \"%s\",\n", $15
    printf "    \"ratio\": \"%s\"\n", $16
    printf "  }"
}
END {
    # End the JSON array
    print "\n]"
}'

3. Run the Script

Save the script as pdfimages_to_json.sh.
Make it executable:
```
chmod +x pdfimages_to_json.sh
```
Run the script, passing your PDF as an argument:
```
./pdfimages_to_json.sh input.pdf > output.json
```

This script will output the pdfimages results in JSON format, and the result will look like this:

Example Output JSON:

[
  {
    "page": 1,
    "num": 0,
    "type": "image",
    "width": 1654,
    "height": 2339,
    "color": "rgb",
    "comp": 3,
    "bpc": 8,
    "enc": "image",
    "interp": "no",
    "object_id": "9 0",
    "x_ppi": 300,
    "y_ppi": 300,
    "size": "251K",
    "ratio": "4.1%"
  },
  {
    "page": 2,
    "num": 1,
    "type": "image",
    "width": 1654,
    "height": 2339,
    "color": "rgb",
    "comp": 3,
    "bpc": 8,
    "enc": "image",
    "interp": "no",
    "object_id": "14 0",
    "x_ppi": 300,
    "y_ppi": 300,
    "size": "245K",
    "ratio": "4.0%"
  }
]

What the Script Does:

It runs pdfimages -list to get the image information from the PDF.
It skips the first two lines (which are headers).
For each image, it formats the output into JSON, including all the relevant fields from pdfimages.

You can now use this JSON for further processing or analysis! Let me know if you need any additional modifications to the script.

dayton-outar commented 1 month ago

To print specific pages from a PDF document, you can use several tools depending on your needs and the platform you're on:

Adobe Acrobat Reader:
- Open the PDF in Adobe Acrobat Reader.
- Go to File > Print.
- In the Print dialog, enter the page range you want to print in the Pages field (e.g., "5-10" for pages 5 to 10).
PDF Software (e.g., Foxit Reader, PDF-XChange Editor):
- Similar to Adobe Acrobat Reader, open the PDF in your chosen software.
- Navigate to the print settings.
- Specify the pages you want to print in the designated field.
Online PDF Tools:
- Websites like Smallpdf or PDF2Go allow you to select and print specific pages from a PDF.
Command Line Tools:
- pdftk: A command-line tool for PDF manipulation. Example command to print pages 5 to 10:
```
pdftk input.pdf cat 5-10 output output.pdf
```
- qpdf: Another command-line tool for manipulating PDFs. Example command:
```
qpdf input.pdf --pages . 5-10 -- output.pdf
```
Preview (macOS):
- Open the PDF in Preview.
- Go to File > Print.
- In the Print dialog, enter the range of pages you want to print in the Pages field.

These tools should cover most of your needs for printing specific pages from a PDF.

dayton-outar commented 1 month ago

So ...

ocrmypdf took 30 minutes to redo ocr conversion of a 49.3MB file containing 302 pages (i.e. 2016 NCB Annual Report). It stressed my computer but it accurately generated an overlay of text for all pages and produced a 90.1MB file mimicking the input document. Excellent!

Now, JSE lists about 200 companies. This means 200 annual reports per year. Not to mention 3 quarterly reports for each company per year. It's noteworthy to mention that the document tested was the biggest file ever produced for NCB as an annual report. But for the tool to produce such an excellent result in 30 minutes is good. This will be a one-time operation that dumps information into the structured database.

So, now that we can extract information from the PDF documents, there still needs to be a more intelligent and optimal way to identify and extract only what is needed. So, the need for tools to read and split the pages out of the original document is needed. This is the reason for looking at qpdf.

dayton-outar commented 1 month ago

So ...

The following command took milliseconds to complete,

qpdf NCB-Annual-Report-2016.pdf --pages . 138-165 -- output.pdf

28 pages were extracted into the output document, output.pdf

dayton-outar commented 1 month ago

To determine which pages of a PDF contain scanned images using pdf2image, you can follow these general steps:

Convert PDF Pages to Images: First, convert the pages of the PDF to image files using pdf2image. This will give you images of each page, which you can then analyze to check for scanned content.
```
pdf2image -o output_directory input.pdf
```
This command will convert each page of input.pdf into an image and save them in the specified output_directory.
Analyze Images for Scanned Content: To check if the images contain text, you would use Optical Character Recognition (OCR) tools like Tesseract to recognize any text in the images. Here's an example workflow using Tesseract:
- Install Tesseract (if not already installed):
```
sudo apt-get install tesseract-ocr
```
- Run Tesseract on Each Image: You can loop through the images and use Tesseract to see if it detects any text. For example, you could use a script to process each image file:
```
for img in output_directory/*.png; do
 tesseract "$img" stdout > /dev/null
 if [ $? -ne 0 ]; then
     echo "No text found in $img"
 else
     echo "Text found in $img"
 fi
done
```
  This script runs Tesseract on each image file, and checks if any text was detected. If text is found, it indicates the image might contain scanned text.

Note: There isn't a direct pdf2image command to locate scanned images; the process involves converting the PDF to images and then analyzing those images. If you need more advanced analysis, combining pdf2image with OCR tools like Tesseract will help you determine which pages contain scanned text.

dayton-outar commented 1 month ago

Sigh ...

A very challenging document to extract information from is going to be the 2016 and 2019 Proven Annual Reports. The challenge comes from a folding page concept that fits two pages of information into one PDF page with a landscape print layout.

dayton-outar commented 1 month ago

To kick off a C++ project that integrates external tools like ocrmypdf, pdfimages, qpdf, and a GPT API for analyzing financial statements, you can follow these key steps. The project will involve managing PDF manipulation and OCR processes while analyzing and transforming financial data into JSON payloads. You’ll use C++ as the core language and Autotools for build automation.

Step-by-Step Guide

1. Project Setup with Autotools

Autotools automates the build process for your C++ project. Here's how to set it up:

Directory structure:

   project-root/
   ├── src/
   ├── include/
   ├── configure.ac
   ├── Makefile.am
   ├── autogen.sh
   └── README.md

Initialize Autotools Files:

configure.ac: Contains macros and checks for your project setup.

Makefile.am: Defines how to build the project.

Create configure.ac:

AC_INIT([Financial PDF Analyzer], [0.1], [your-email@example.com])
AM_INIT_AUTOMAKE
AC_PROG_CXX
AC_PROG_RANLIB
AC_PROG_LIBTOOL
AC_CONFIG_FILES([Makefile src/Makefile])
AC_OUTPUT

Create Makefile.am:

SUBDIRS = src

Create src/Makefile.am:

bin_PROGRAMS = pdf_analyzer
pdf_analyzer_SOURCES = main.cpp pdf_utility.cpp ocr.cpp gpt.cpp

Autotools Setup:

Create autogen.sh:

 #!/bin/sh
 autoreconf --install
 ./configure
 make

Run autogen.sh to set up the project.

2. Integrating PDF Tools (`ocrmypdf`, `pdfimages`, `qpdf`)

These tools will be called from C++ using the system() function or popen() to capture output. Create a utility class to handle the calls to external tools.

Example: pdf_utility.cpp

   #include <iostream>
   #include <cstdlib>   // for system()
   #include <cstdio>    // for popen()

   void runOcrMyPdf(const std::string &inputPdf, const std::string &outputPdf) {
       std::string command = "ocrmypdf " + inputPdf + " " + outputPdf;
       system(command.c_str());
   }

   void runPdfImages(const std::string &inputPdf, const std::string &outputDir) {
       std::string command = "pdfimages -list " + inputPdf + " > " + outputDir + "/images.json";
       system(command.c_str());
   }

   void runQpdf(const std::string &inputPdf, const std::string &outputPdf) {
       std::string command = "qpdf --decrypt " + inputPdf + " " + outputPdf;
       system(command.c_str());
   }

3. Using the GPT API in C++

You can integrate the GPT API using a library like cURL for making HTTP requests.

Install cURL:

   sudo apt-get install libcurl4-openssl-dev

Example: gpt.cpp

   #include <iostream>
   #include <curl/curl.h>
   #include <nlohmann/json.hpp>

   std::string callGptApi(const std::string &apiKey, const std::string &prompt) {
       CURL *curl;
       CURLcode res;
       std::string response;

       curl = curl_easy_init();
       if(curl) {
           struct curl_slist *headers = NULL;
           headers = curl_slist_append(headers, "Content-Type: application/json");
           headers = curl_slist_append(headers, ("Authorization: Bearer " + apiKey).c_str());

           nlohmann::json jsonPayload = {
               {"model", "gpt-4"},
               {"prompt", prompt},
               {"max_tokens", 1000}
           };

           std::string data = jsonPayload.dump();
           curl_easy_setopt(curl, CURLOPT_URL, "https://api.openai.com/v1/completions");
           curl_easy_setopt(curl, CURLOPT_HTTPHEADER, headers);
           curl_easy_setopt(curl, CURLOPT_POSTFIELDS, data.c_str());

           // Lambda function to capture response
           curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, [](void *ptr, size_t size, size_t nmemb, std::string *data) {
               data->append((char *)ptr, size * nmemb);
               return size * nmemb;
           });
           curl_easy_setopt(curl, CURLOPT_WRITEDATA, &response);

           res = curl_easy_perform(curl);
           if(res != CURLE_OK)
               std::cerr << "curl_easy_perform() failed: " << curl_easy_strerror(res) << std::endl;

           curl_easy_cleanup(curl);
       }
       return response;
   }

4. Core Functionality: Reading PDFs and Analyzing Data

You’ll use ocrmypdf to extract text from scanned PDFs.
pdfimages will help you detect and extract images from PDFs.
qpdf will be useful for handling encrypted or secured PDFs.

Once you have text data (from OCR or extracted content), use the GPT API to analyze and convert it into structured data like JSON.

Example: main.cpp

#include "pdf_utility.h"
#include "gpt.h"
#include <iostream>

int main(int argc, char *argv[]) {
   if (argc < 2) {
       std::cerr << "Usage: pdf_analyzer <input_pdf>" << std::endl;
       return 1;
   }

   std::string inputPdf = argv[1];
   std::string outputPdf = "output.pdf";

   // Run PDF tools
   runOcrMyPdf(inputPdf, outputPdf);
   runPdfImages(inputPdf, "outputDir");
   runQpdf(inputPdf, "output_decrypted.pdf");

   // Call GPT API for analysis
   std::string apiKey = "your-gpt-api-key";
   std::string analysis = callGptApi(apiKey, "Analyze this financial statement");

   std::cout << "Analysis: " << analysis << std::endl;
   return 0;
}

5. Building and Running the Project

After setting up the project structure and Autotools configuration, build and run the project:

   ./autogen.sh
   ./configure
   make
   ./pdf_analyzer input.pdf

6. Error Handling and Testing

Implement proper error handling, especially around file access and API calls.
Test the interaction between external tools (ocrmypdf, pdfimages, qpdf) and your application by providing sample PDFs.
Consider adding unit tests for core components.

7. Next Steps

Expand GPT Queries: Enhance the queries sent to GPT by adding more financial statement prompts.
JSON Generation: After extracting data (text, images), organize the results into structured JSON payloads.
Performance Optimization: For large PDF files, consider multithreading or asynchronous API calls for performance.

This should provide you with a solid foundation to kick off the project. Let me know if you'd like additional details on any step!

dayton-outar / shillings