dayton-outar / shillings

An attempt at an immersive stock-trading platform
0 stars 0 forks source link

Financial Report: Data Entry Enhancements #4

Open dayton-outar opened 1 year ago

dayton-outar commented 1 year ago

To make manual entry of statement items easier for users, the following is suggested,

dayton-outar commented 1 year ago

Implementing hot keys poses a challenge in that it clashes with the browsers hot keys. Maybe, should stick with point and click until a later date. Focus on more substantial value-added features for now.

dayton-outar commented 11 months ago

To automate entry via parsing annual and quarterly report. We will need an OCR solution such as Asprise OCR. Such a tool should be able to read the following image and output the data in a data structure that can be injected into the system.

1834-balance-sheet-march-2021

dayton-outar commented 11 months ago

More reports that can be tested on by the automation.

gk-balance-sheet-2010

dayton-outar commented 11 months ago

More.

genac-balance-sheet-2021

dayton-outar commented 11 months ago

A code snippet for using the Asprise OCR,

using System;
using AspriseOCR;

namespace OCRDemo
{
    class Program
    {
        static void Main(string[] args)
        {
            // Initialize the OCR engine
            Ocr.Setup();

            // Create an OCR instance
            Ocr ocr = new Ocr();

            // Start the OCR process with table detection enabled
            string ocrResult = ocr.Recognize("path_to_your_image.jpg", -1, -1, -1, -1, -1,
                Ocr.RECOGNIZE_TYPE_ALL, Ocr.OUTPUT_FORMAT_PLAINTEXT, 
                "prop.tableDetection=true|prop.tableExtractMethod=stream", "");

            // Output the OCR result
            Console.WriteLine(ocrResult);

            // Clean up
            ocr.Dispose();
        }
    }
}
dayton-outar commented 5 months ago

So, today I have found some more options for invoice or tabular data extraction solutions that use OCR. They are as follows:

dayton-outar commented 5 months ago

The ChatPDF online application serves as an example of the feature needed to embellish the graphical display of the annual report. So, with prompt engineering the line items in the financial statements can be classified and prepared for entry into the shillings database.

This issue will indeed operate in tandem with issue #51 and thus shares interest in a an extraction process that can intelligently classify the information provided in the annual reports that come in many variations.

dayton-outar commented 4 months ago

Automating the parsing of audited financial statements to extrapolate items from the balance sheet, profit/loss, and cashflow statements involves several steps. Here's a structured approach to achieving this:

1. Data Collection

2. Parsing the Documents

3. Data Extraction and Cleaning

4. Data Structuring

5. Automation Workflow

6. Data Storage and Retrieval

7. Reporting and Analysis

Example: End-to-End Workflow with Python

Here's a simple example combining some of the steps:

import pdfplumber
import pandas as pd

def parse_pdf(file_path):
    data = {"balance_sheet": [], "profit_loss": [], "cash_flow": []}
    with pdfplumber.open(file_path) as pdf:
        for page in pdf.pages:
            tables = page.extract_tables()
            for table in tables:
                # Process and classify table data into balance_sheet, profit_loss, or cash_flow
                pass
    return data

def main():
    file_path = "financial_statement.pdf"
    parsed_data = parse_pdf(file_path)

    # Example processing for balance_sheet
    balance_sheet_df = pd.DataFrame(parsed_data['balance_sheet'], columns=['Item', 'Value'])
    profit_loss_df = pd.DataFrame(parsed_data['profit_loss'], columns=['Item', 'Value'])
    cash_flow_df = pd.DataFrame(parsed_data['cash_flow'], columns=['Item', 'Value'])

    # Further processing, storage, and analysis
    print(balance_sheet_df)
    print(profit_loss_df)
    print(cash_flow_df)

if __name__ == "__main__":
    main()
dayton-outar commented 4 months ago

Handling different layouts of financial statements requires a flexible and adaptable approach in your automation. The strategy involves the following steps:

  1. Identify Common Patterns and Variations: Analyze the different formats and identify common patterns, such as key section headers, common terms, and typical data structures.

  2. Create Template-Based Parsing Rules: Develop a set of parsing rules or templates for each identified layout. Use these templates to guide the extraction process.

  3. Implement Layout Detection: Build a mechanism to detect which layout a particular document follows. This can be based on specific keywords, section headers, or structural cues.

  4. Modularize Parsing Logic: Structure your code in a modular way to handle different parsing rules based on the detected layout.

Here’s a step-by-step guide to building such an automation system:

Step-by-Step Guide

1. Analyze and Define Layouts

2. Template-Based Parsing Rules

3. Layout Detection Mechanism

4. Modular Parsing Logic

Example Implementation

a. Layout Detection

import re

def detect_layout(text):
    if "Balance Sheet" in text and "Profit and Loss" in text:
        return "layout_1"
    elif "Statement of Financial Position" in text and "Statement of Comprehensive Income" in text:
        return "layout_2"
    # Add more conditions for different layouts
    else:
        return "unknown"

b. Template-Based Parsing Functions

def parse_layout_1(text):
    balance_sheet = extract_section(text, "Balance Sheet", "Profit and Loss")
    profit_loss = extract_section(text, "Profit and Loss", "Cash Flow Statement")
    cash_flow = extract_section(text, "Cash Flow Statement", "End of Document")
    return balance_sheet, profit_loss, cash_flow

def parse_layout_2(text):
    balance_sheet = extract_section(text, "Statement of Financial Position", "Statement of Comprehensive Income")
    profit_loss = extract_section(text, "Statement of Comprehensive Income", "Cash Flow Statement")
    cash_flow = extract_section(text, "Cash Flow Statement", "End of Document")
    return balance_sheet, profit_loss, cash_flow

def extract_section(text, start_marker, end_marker):
    start_idx = text.find(start_marker)
    end_idx = text.find(end_marker, start_idx)
    if start_idx == -1 or end_idx == -1:
        return ""
    return text[start_idx:end_idx]

c. Modular Parsing Logic

def parse_financial_statements(text):
    layout = detect_layout(text)
    if layout == "layout_1":
        return parse_layout_1(text)
    elif layout == "layout_2":
        return parse_layout_2(text)
    else:
        raise ValueError("Unknown layout format")

def main():
    # Example OCR text extraction from images
    ocr_text = """..."""  # OCR output text here

    balance_sheet, profit_loss, cash_flow = parse_financial_statements(ocr_text)

    # Convert to DataFrame for further processing
    balance_sheet_df = pd.DataFrame(balance_sheet, columns=['Item', 'Value'])
    profit_loss_df = pd.DataFrame(profit_loss, columns=['Item', 'Value'])
    cash_flow_df = pd.DataFrame(cash_flow, columns=['Item', 'Value'])

    print(balance_sheet_df)
    print(profit_loss_df)
    print(cash_flow_df)

if __name__ == "__main__":
    main()

Additional Considerations

  1. Preprocessing for OCR: Ensure images are preprocessed for better OCR accuracy (e.g., binarization, noise reduction).

  2. Regular Expressions for Flexibility: Use regular expressions to capture variations in section headers and item formats.

  3. Machine Learning for Layout Detection: Consider using machine learning models to classify document layouts based on features extracted from the text.

  4. Continuous Improvement: As new formats are encountered, update the layout detection and parsing logic.

  5. Error Handling: Implement robust error handling to manage unknown formats or parsing errors gracefully.

  6. Documentation and Testing: Document different layouts and thoroughly test the parsing functions with various samples to ensure accuracy.

By following this structured approach, you can build a flexible automation system capable of handling various layouts of financial statements.

dayton-outar commented 4 months ago

So, balance sheets or statement of financial positions (and income-expense and cashflow statements) consist of a list of descriptions of each aggregated item and the totals associated.

An LLM will be required to read the extracted data and classify each line item described in Sectional and Assays. So, each line item except those that computed lines like Total Assets and Total Liabilities, will have to be in a section of any shown below,

A line item can be associated with more than one Assays. For example, a line item that describes expenses such as Direct and Operating Expenses can be associated with Direct and Operating expenses. This is achieved in the API using bit flags.

dayton-outar commented 1 month ago

So, major major developments happened during my break and I was inspired by How to use ChatGPT to analyze PDFs for free to try out a prompt on ChatGPT pertaining to this particular enhancement.

I decided to use a quarterly report from GraceKennedy. This document had no images in it. See below a snapshot of one of the 14 page document.

Screenshot from 2024-08-28 06-15-26

I uploaded the 14 page document and added the following prompt to associate with the document,

Give me a JSON string of the income statement that does not include aggregated items except earnings per stock unit (or eps or earnings per share) and categorize each item in the statement as an array item that consist of the following properties,

  1. The description of the line item
  2. The section within the statement that can be either: Income-Expense, Earnings per Stock
  3. Categorize the line item in all appropriate item classes from this list into an array property called analyte: Operating, Direct, Depreciation, Impairment, Interest, Other, Investment, Tax, Shareholders, NonControlling, Basic, Diluted, Cash, Inventories, Receivables, Payables, Current, Fixed, Loan, Intangible, Capital, Retained, Reserves, Dividends, Shares, Treasury, Wages, ExchangeRateChanges, Adjustments, Note.
  4. The amount associated, in the second column only, with that line item

The results were delightful.

This is what I got.

" Here is the JSON string representing the income statement with each item categorized as requested:

{
  "income_statement": [
    {
      "description": "Revenue from products and services",
      "section": "Income-Expense",
      "analyte": ["Operating", "Direct", "Current"],
      "amount": 92355966
    },
    {
      "description": "Interest revenue",
      "section": "Income-Expense",
      "analyte": ["Investment", "Interest", "Other"],
      "amount": 3420263
    },
    {
      "description": "Direct and operating expenses",
      "section": "Income-Expense",
      "analyte": ["Operating", "Direct", "Wages", "Cash"],
      "amount": 89499744
    },
    {
      "description": "Net impairment losses on financial assets",
      "section": "Income-Expense",
      "analyte": ["Impairment", "Loan"],
      "amount": 392404
    },
    {
      "description": "Other income",
      "section": "Income-Expense",
      "analyte": ["Operating", "Other"],
      "amount": 2712404
    },
    {
      "description": "Interest income – non-financial services",
      "section": "Income-Expense",
      "analyte": ["Interest", "Other"],
      "amount": 433576
    },
    {
      "description": "Interest expense – non-financial services",
      "section": "Income-Expense",
      "analyte": ["Interest", "Loan", "Payables"],
      "amount": 858262
    },
    {
      "description": "Share of results of associates and joint ventures",
      "section": "Income-Expense",
      "analyte": ["Investment", "Shareholders"],
      "amount": 368294
    },
    {
      "description": "Taxation",
      "section": "Income-Expense",
      "analyte": ["Tax", "Cash"],
      "amount": 2305826
    },
    {
      "description": "Net Profit for the period",
      "section": "Income-Expense",
      "analyte": ["Operating", "Retained", "Shareholders"],
      "amount": 6234267
    },
    {
      "description": "Profit attributable to Owners of GraceKennedy Limited",
      "section": "Income-Expense",
      "analyte": ["Operating", "Retained", "Shareholders"],
      "amount": 5677684
    },
    {
      "description": "Profit attributable to Non-controlling interests",
      "section": "Income-Expense",
      "analyte": ["NonControlling", "Operating"],
      "amount": 556583
    },
    {
      "description": "Earnings per Stock Unit - Basic",
      "section": "Earnings per Stock",
      "analyte": ["Basic", "Shares", "Note"],
      "amount": 5.73
    },
    {
      "description": "Earnings per Stock Unit - Diluted",
      "section": "Earnings per Stock",
      "analyte": ["Diluted", "Shares", "Note"],
      "amount": 5.69
    }
  ]
}

This JSON structure includes each line item from the income statement, categorized by its section, analytes, and associated amounts. "

This saves a lot of time and opens up possibilities for efficient use of data entry time.

The big question is now: how big a file can GPT handle?

dayton-outar commented 1 month ago

After realizing that GPT is recognizing the "first" column as first in it's category and not first overall, I corrected by prompt as follows,

Excellent. I just realized that I want this for the first column of amounts

"Here is the JSON string with the amounts from the first column:

{
  "income_statement": [
    {
      "description": "Revenue from products and services",
      "section": "Income-Expense",
      "analyte": ["Operating", "Direct", "Current"],
      "amount": 31275213
    },
    {
      "description": "Interest revenue",
      "section": "Income-Expense",
      "analyte": ["Investment", "Interest", "Other"],
      "amount": 1149675
    },
    {
      "description": "Direct and operating expenses",
      "section": "Income-Expense",
      "analyte": ["Operating", "Direct", "Wages", "Cash"],
      "amount": 30248373
    },
    {
      "description": "Net impairment losses on financial assets",
      "section": "Income-Expense",
      "analyte": ["Impairment", "Loan"],
      "amount": 162481
    },
    {
      "description": "Other income",
      "section": "Income-Expense",
      "analyte": ["Operating", "Other"],
      "amount": 1185127
    },
    {
      "description": "Interest income – non-financial services",
      "section": "Income-Expense",
      "analyte": ["Interest", "Other"],
      "amount": 157484
    },
    {
      "description": "Interest expense – non-financial services",
      "section": "Income-Expense",
      "analyte": ["Interest", "Loan", "Payables"],
      "amount": 300416
    },
    {
      "description": "Share of results of associates and joint ventures",
      "section": "Income-Expense",
      "analyte": ["Investment", "Shareholders"],
      "amount": 38715
    },
    {
      "description": "Taxation",
      "section": "Income-Expense",
      "analyte": ["Tax", "Cash"],
      "amount": 835635
    },
    {
      "description": "Net Profit for the period",
      "section": "Income-Expense",
      "analyte": ["Operating", "Retained", "Shareholders"],
      "amount": 2259309
    },
    {
      "description": "Profit attributable to Owners of GraceKennedy Limited",
      "section": "Income-Expense",
      "analyte": ["Operating", "Retained", "Shareholders"],
      "amount": 2090387
    },
    {
      "description": "Profit attributable to Non-controlling interests",
      "section": "Income-Expense",
      "analyte": ["NonControlling", "Operating"],
      "amount": 168922
    },
    {
      "description": "Earnings per Stock Unit - Basic",
      "section": "Earnings per Stock",
      "analyte": ["Basic", "Shares", "Note"],
      "amount": 2.11
    },
    {
      "description": "Earnings per Stock Unit - Diluted",
      "section": "Earnings per Stock",
      "analyte": ["Diluted", "Shares", "Note"],
      "amount": 2.09
    }
  ]
}

This JSON structure now includes the amounts from the first column of the income statement. "

dayton-outar commented 1 month ago

I find that uploading a 9.7 MB file takes less than a minute and reading it takes about 15 seconds.

However, after testing GPT with a document that contains images of the report failed to read the report accurately and produced inaccurate results.

dayton-outar commented 1 month ago

So, for GPT took 2 minutes to upload 2016 NCB Annual Report, which was 49 MB, a relatively big file. This file seemed to have some layer of protection over the audited financial statement. So, GPT was able to read some but not all of the line items in the income statement. Some of the figures were accurate but some were inaccurate (you could see it reading the first part but not the entire figure).

This is not bad but to save time all the information need to be read accurately. So, it looks like whenever the textual information can be accessed in the PDF document, GPT will read it accurately. However, if the statement is in an image or protected, then GPT will deliver inaccurate and incomplete information.

dayton-outar commented 1 month ago

So, attempting to use GPT to read 2020 NCB Annual Report, which was 14.5 MB.

This was more promising. The financial statements were not protected and were not placed in an image in the document. See below the results from same GPT prompt used for income statements,


Based on the provided NCB Financial Group Limited Annual Report 2020, I have extracted the relevant data from the income statement and structured it in the requested JSON format:

{
  "IncomeStatement": [
    {
      "description": "Interest income",
      "section": "Income-Expense",
      "analyte": ["Operating", "Investment"],
      "amount": 74421878
    },
    {
      "description": "Interest expense",
      "section": "Income-Expense",
      "analyte": ["Operating", "Interest"],
      "amount": -21932169
    },
    {
      "description": "Net interest income",
      "section": "Income-Expense",
      "analyte": ["Operating", "Investment"],
      "amount": 52489709
    },
    {
      "description": "Fee and commission income",
      "section": "Income-Expense",
      "analyte": ["Operating"],
      "amount": 25925325
    },
    {
      "description": "Fee and commission expense",
      "section": "Income-Expense",
      "analyte": ["Operating"],
      "amount": -4555918
    },
    {
      "description": "Net fee and commission income",
      "section": "Income-Expense",
      "analyte": ["Operating"],
      "amount": 21369407
    },
    {
      "description": "Gain on foreign currency and investment activities",
      "section": "Income-Expense",
      "analyte": ["Operating", "Investment"],
      "amount": 8793286
    },
    {
      "description": "Credit impairment losses",
      "section": "Income-Expense",
      "analyte": ["Operating", "Impairment"],
      "amount": -10284994
    },
    {
      "description": "Dividend income",
      "section": "Income-Expense",
      "analyte": ["Operating", "Investment"],
      "amount": 1901300
    },
    {
      "description": "Other operating income",
      "section": "Income-Expense",
      "analyte": ["Operating", "Other"],
      "amount": 2102190
    },
    {
      "description": "Net result from banking and investment activities",
      "section": "Income-Expense",
      "analyte": ["Operating", "Investment"],
      "amount": 76370898
    },
    {
      "description": "Premium income",
      "section": "Income-Expense",
      "analyte": ["Operating", "Investment"],
      "amount": 135202001
    },
    {
      "description": "Insurance premium ceded to reinsurers",
      "section": "Income-Expense",
      "analyte": ["Operating"],
      "amount": -42003606
    },
    {
      "description": "Reinsurance commission income",
      "section": "Income-Expense",
      "analyte": ["Operating"],
      "amount": 8470579
    },
    {
      "description": "Net underwriting income",
      "section": "Income-Expense",
      "analyte": ["Operating"],
      "amount": 101668974
    },
    {
      "description": "Gross policyholders’ and annuitants’ benefits and reserves",
      "section": "Income-Expense",
      "analyte": ["Operating"],
      "amount": -63223937
    },
    {
      "description": "Reinsurance on policyholders’ and annuitants’ benefits and reserves",
      "section": "Income-Expense",
      "analyte": ["Operating"],
      "amount": 7016737
    },
    {
      "description": "Commission and other selling expenses",
      "section": "Income-Expense",
      "analyte": ["Operating"],
      "amount": -13005783
    },
    {
      "description": "Net result from insurance activities",
      "section": "Income-Expense",
      "analyte": ["Operating", "Investment"],
      "amount": 32455991
    },
    {
      "description": "Net operating income",
      "section": "Income-Expense",
      "analyte": ["Operating"],
      "amount": 108826889
    },
    {
      "description": "Staff costs",
      "section": "Income-Expense",
      "analyte": ["Operating", "Wages"],
      "amount": 40526668
    },
    {
      "description": "Depreciation and amortisation",
      "section": "Income-Expense",
      "analyte": ["Operating", "Depreciation"],
      "amount": 8529471
    },
    {
      "description": "Finance cost",
      "section": "Income-Expense",
      "analyte": ["Operating", "Interest"],
      "amount": 1411727
    },
    {
      "description": "Other operating expenses",
      "section": "Income-Expense",
      "analyte": ["Operating"],
      "amount": 31097938
    },
    {
      "description": "Operating Profit",
      "section": "Income-Expense",
      "analyte": ["Operating"],
      "amount": 27261085
    },
    {
      "description": "Taxation",
      "section": "Income-Expense",
      "analyte": ["Tax"],
      "amount": -690064
    },
    {
      "description": "NET PROFIT",
      "section": "Income-Expense",
      "analyte": ["Operating"],
      "amount": 26883412
    },
    {
      "description": "Earnings per stock unit",
      "section": "Earnings per Stock",
      "analyte": ["Shares"],
      "amount": 8.01
    }
  ]
}

This JSON structure adheres to the format you requested, categorizing each line item based on its description, section, analyte classification, and the associated amount.

dayton-outar commented 1 month ago

If you want to avoid manual editing of text or tables and retain the format, you can follow this more automated approach:

1. Use an Advanced PDF Tool with OCR and Format Retention

Certain PDF tools are designed to extract content and retain the exact formatting automatically. These tools can convert the scanned images to text (via OCR) and preserve the layout (tables, images, fonts, etc.).

Here are a few options:

Adobe Acrobat Pro (Automated Process)

2. Batch OCR Automation with Scripting (if you want to code it)

If you'd prefer an automated, programmatic approach, you can use scripting to process the PDF and maintain its format:

Would you like help setting up one of these tools or methods?

dayton-outar commented 1 month ago

Here's a Node.js script that reads the CSV file, processes each row, and generates SQL INSERT commands for a table called [stocks].[dbo].[StockIndices]. You'll need the csv-parser package to read the CSV file.

Steps:

  1. Install the csv-parser package:

    npm install csv-parser
  2. Here’s the script:

const fs = require('fs');
const csv = require('csv-parser');

// Path to your CSV file
const filePath = 'path/to/your/file.csv';

const insertSQL = [];

fs.createReadStream(filePath)
  .pipe(csv())
  .on('data', (row) => {
    const { Date, MarketIndexNo, Value } = row;

    // Create SQL INSERT statement for each row
    const sql = `
      INSERT INTO [stocks].[dbo].[StockIndices] (Date, MarketIndexNo, Value)
      VALUES ('${Date}', ${MarketIndexNo}, ${Value});
    `;

    insertSQL.push(sql);
  })
  .on('end', () => {
    console.log('CSV file processed. SQL commands:');
    insertSQL.forEach(sql => console.log(sql));

    // Optionally write the SQL commands to a file
    fs.writeFileSync('insert_commands.sql', insertSQL.join('\n'));
  })
  .on('error', (err) => {
    console.error('Error reading CSV file:', err);
  });

How it works:

Example output:

If your CSV file has the following rows:

Date,MarketIndexNo,Value
2024-09-07,1,100.5
2024-09-08,2,101.7

It will generate:

INSERT INTO [stocks].[dbo].[StockIndices] (Date, MarketIndexNo, Value)
VALUES ('2024-09-07', 1, 100.5);

INSERT INTO [stocks].[dbo].[StockIndices] (Date, MarketIndexNo, Value)
VALUES ('2024-09-08', 2, 101.7);

Notes:

dayton-outar commented 1 month ago

Common Table Expression SELECT Query to get ValueChange value from new market indices records,

WITH cte AS (
    SELECT 
        i.[No],
        i.[Date],
        i.[MarketIndexNo],
        i.[Value],
        LAG(i.[Value]) OVER (PARTITION BY i.[MarketIndexNo] ORDER BY i.[Date]) AS PrevValue
    FROM [stocks].[dbo].[StockIndices] i
)
SELECT 
    cte.[Date],
    cte.[MarketIndexNo],
    cte.[Value],
    cte.[Value] - cte.[PrevValue] AS ValueChange,
    si.[MarketIndexNo],
    MI.[Name]
FROM [cte]
    INNER JOIN [stocks].[dbo].[StockIndices] si ON si.[No] = cte.[No]
    INNER JOIN [stocks].[dbo].[MarketIndices] mi ON mi.[No] = si.[MarketIndexNo]
WHERE cte.Date = si.Date
    AND cte.MarketIndexNo = si.MarketIndexNo
    AND si.[LogNo] = 176317;
dayton-outar commented 1 month ago

Here's the update query that was eventually used,

WITH cte AS (
    SELECT 
        i.[No],
        i.[Date],
        i.[MarketIndexNo],
        i.[Value],
        LAG(i.[Value]) OVER (PARTITION BY i.[MarketIndexNo] ORDER BY i.[Date]) AS PrevValue
    FROM [stocks].[dbo].[StockIndices] i
)
UPDATE si
SET [ValueChange] = ISNULL(cte.[Value] - cte.PrevValue, 0)
FROM [cte]
    INNER JOIN [stocks].[dbo].[StockIndices] si ON si.[No] = cte.[No]
    INNER JOIN [stocks].[dbo].[MarketIndices] mi ON mi.[No] = si.[MarketIndexNo]
WHERE cte.[Date] = si.[Date]
    AND cte.MarketIndexNo = si.MarketIndexNo
    AND si.[LogNo] = 176317;
dayton-outar commented 1 month ago

A query to review the data,

SELECT  i.[Value]
      , i.[ValueChange]
      , i.[Date]
      , l.[Logged]
      , i.[MarketIndexNo]
      , m.[Name]
  FROM [stocks].[dbo].[StockIndices] i
    INNER JOIN [stocks].[dbo].[Logs] l ON l.[No] = i.[LogNo]
    INNER JOIN [stocks].[dbo].[MarketIndices] m ON m.[No] = i.[MarketIndexNo]
WHERE i.[LogNo] = 176317
ORDER BY l.[Logged] DESC
dayton-outar commented 1 month ago

Dividends information is tricky in that many companies that were de-listed no longer have their corporate actions available on Jamaica Stock Exchange.

dayton-outar commented 1 month ago

So ....

I printed a page of the scanned document of 2016 NCB Annual Report, in particular page 153 that has the "Consolidated Income Statement". I then used ocrmypdf to convert the 2016 NCB Annual Report into a format that maintained the layout but was readable by the GPT (meaning I could extract the text).

NCB-Annual-Report-2016.pdf

The ocrmypdf command was as follows,

ocrmypdf --redo-ocr NCB-Annual-Report-2016.pdf NCB-Annual-2016-converted.pdf

Screenshot from 2024-09-08 06-53-33

NCB-Annual-2016-converted.pdf

The output was positive. In fact, the output from the GPT delivered accurate readings,

[
  {
    "description": "Interest income",
    "section": "Income-Expense",
    "analyte": ["Operating", "Interest"],
    "amount": 39156349
  },
  {
    "description": "Interest expense",
    "section": "Income-Expense",
    "analyte": ["Operating", "Interest"],
    "amount": -11032579
  },
  {
    "description": "Net interest income",
    "section": "Income-Expense",
    "analyte": ["Operating", "Interest"],
    "amount": 28123770
  },
  {
    "description": "Fee and commission income",
    "section": "Income-Expense",
    "analyte": ["Operating"],
    "amount": 13575872
  },
  {
    "description": "Fee and commission expense",
    "section": "Income-Expense",
    "analyte": ["Operating"],
    "amount": -2663405
  },
  {
    "description": "Net fee and commission income",
    "section": "Income-Expense",
    "analyte": ["Operating"],
    "amount": 10912467
  },
  {
    "description": "Gain on foreign currency and investment activities",
    "section": "Income-Expense",
    "analyte": ["Investment"],
    "amount": 4736122
  },
  {
    "description": "Premium income",
    "section": "Income-Expense",
    "analyte": ["Other"],
    "amount": 7991693
  },
  {
    "description": "Dividend income",
    "section": "Income-Expense",
    "analyte": ["Investment"],
    "amount": 149921
  },
  {
    "description": "Other operating income",
    "section": "Income-Expense",
    "analyte": ["Other"],
    "amount": 143736
  },
  {
    "description": "Staff costs",
    "section": "Income-Expense",
    "analyte": ["Wages"],
    "amount": 13809023
  },
  {
    "description": "Provision for credit losses",
    "section": "Income-Expense",
    "analyte": ["Other"],
    "amount": 612355
  },
  {
    "description": "Policyholders’ and annuitants’ benefits and reserves",
    "section": "Income-Expense",
    "analyte": ["Reserves"],
    "amount": 4292643
  },
  {
    "description": "Depreciation and amortisation",
    "section": "Income-Expense",
    "analyte": ["Depreciation"],
    "amount": 1899414
  },
  {
    "description": "Impairment losses on securities",
    "section": "Income-Expense",
    "analyte": ["Impairment"],
    "amount": 79765
  },
  {
    "description": "Other operating expenses",
    "section": "Income-Expense",
    "analyte": ["Operating", "Other"],
    "amount": 13348202
  },
  {
    "description": "Share of profit of associates",
    "section": "Income-Expense",
    "analyte": ["Other"],
    "amount": 832480
  },
  {
    "description": "Loss on dilution of investment in associate",
    "section": "Income-Expense",
    "analyte": ["Investment"],
    "amount": -50748
  },
  {
    "description": "Taxation",
    "section": "Income-Expense",
    "analyte": ["Tax"],
    "amount": -4479992
  },
  {
    "description": "Net profit attributable to stockholders of the Bank",
    "section": "Income-Expense",
    "analyte": ["Shareholders"],
    "amount": 15636446
  },
  {
    "description": "Net profit attributable to non-controlling interest",
    "section": "Income-Expense",
    "analyte": ["NonControlling"],
    "amount": -1187886
  },
  {
    "description": "Basic earnings per stock unit",
    "section": "Earnings per Stock",
    "analyte": ["Basic", "Shares"],
    "amount": 6.35
  },
  {
    "description": "Diluted earnings per stock unit",
    "section": "Earnings per Stock",
    "analyte": ["Diluted", "Shares"],
    "amount": 5.87
  }
]

Reference

  1. OCRmyPDF
  2. OCRmyPDF Documentation
dayton-outar commented 1 month ago

Yes, you can use pdfimages, a command-line tool available in Ubuntu, to check for pages that contain scanned images in a PDF. This tool is part of the poppler-utils package and is useful for extracting images from PDF files, but you can also use it to identify which pages contain raster (scanned) images.

Steps to Identify Pages with Scanned Images:

  1. Install pdfimages (if not already installed):

    sudo apt update
    sudo apt install poppler-utils
  2. Extract images from the PDF: Run pdfimages on your PDF file to extract all the images and see which pages contain them. For example:

    pdfimages -list input.pdf

    This command lists all the images in the PDF, along with their respective page numbers. The output includes:

    • The page number where the image is located.
    • The resolution and size of the images.

Example Output:

page   num  type   width height color comp bpc  enc interp  object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
   1     0 image    1654  2339  rgb     3   8  image  no         9  0   300   300  251K  4.1%
   2     1 image    1654  2339  rgb     3   8  image  no        14  0   300   300  245K  4.0%

How to Identify Scanned Pages:

Example Command for Checking Scanned Pages:

pdfimages -list input.pdf | grep -E "^ *[0-9]+ +image"

This will list only the pages where raster images are found.

Additional Tools for PDF Analysis:

If you need more sophisticated analysis (e.g., identifying whether a page is purely scanned without text layers), you can combine this with:

Conclusion:

pdfimages is the best tool for identifying which pages in a PDF contain scanned images by listing all images and their associated pages.

dayton-outar commented 1 month ago

You can capture the output from the pdfimages command and transform it into a JSON format using shell tools like awk or sed, along with a simple script. Here’s a method to do this in Ubuntu.

1. Extracting PDF Image Information and Converting to JSON

First, you can run the pdfimages command and parse its output, converting it into JSON format. Here’s a simple bash script to achieve that.

Step-by-Step Approach:

  1. Run the pdfimages -list command to get the output.

    pdfimages -list input.pdf
  2. Use Bash with Awk or Sed to format the output into JSON.

Here’s an example bash script that converts the pdfimages -list output into JSON:

#!/bin/bash

# Get the list of images from the PDF
pdfimages -list "$1" | awk '
BEGIN {
    # Initialize the JSON array
    print "["
    first = 1
}
NR > 2 {
    # Skip the header lines
    if (first == 0) {
        print ","
    }
    first = 0

    # Print each entry in JSON format
    printf "  {\n"
    printf "    \"page\": %d,\n", $1
    printf "    \"num\": %d,\n", $2
    printf "    \"type\": \"%s\",\n", $3
    printf "    \"width\": %d,\n", $4
    printf "    \"height\": %d,\n", $5
    printf "    \"color\": \"%s\",\n", $6
    printf "    \"comp\": %d,\n", $7
    printf "    \"bpc\": %d,\n", $8
    printf "    \"enc\": \"%s\",\n", $9
    printf "    \"interp\": \"%s\",\n", $10
    printf "    \"object_id\": \"%d %d\",\n", $11, $12
    printf "    \"x_ppi\": %d,\n", $13
    printf "    \"y_ppi\": %d,\n", $14
    printf "    \"size\": \"%s\",\n", $15
    printf "    \"ratio\": \"%s\"\n", $16
    printf "  }"
}
END {
    # End the JSON array
    print "\n]"
}'

3. Run the Script

This script will output the pdfimages results in JSON format, and the result will look like this:

Example Output JSON:

[
  {
    "page": 1,
    "num": 0,
    "type": "image",
    "width": 1654,
    "height": 2339,
    "color": "rgb",
    "comp": 3,
    "bpc": 8,
    "enc": "image",
    "interp": "no",
    "object_id": "9 0",
    "x_ppi": 300,
    "y_ppi": 300,
    "size": "251K",
    "ratio": "4.1%"
  },
  {
    "page": 2,
    "num": 1,
    "type": "image",
    "width": 1654,
    "height": 2339,
    "color": "rgb",
    "comp": 3,
    "bpc": 8,
    "enc": "image",
    "interp": "no",
    "object_id": "14 0",
    "x_ppi": 300,
    "y_ppi": 300,
    "size": "245K",
    "ratio": "4.0%"
  }
]

What the Script Does:

You can now use this JSON for further processing or analysis! Let me know if you need any additional modifications to the script.

dayton-outar commented 1 month ago

To print specific pages from a PDF document, you can use several tools depending on your needs and the platform you're on:

  1. Adobe Acrobat Reader:

    • Open the PDF in Adobe Acrobat Reader.
    • Go to File > Print.
    • In the Print dialog, enter the page range you want to print in the Pages field (e.g., "5-10" for pages 5 to 10).
  2. PDF Software (e.g., Foxit Reader, PDF-XChange Editor):

    • Similar to Adobe Acrobat Reader, open the PDF in your chosen software.
    • Navigate to the print settings.
    • Specify the pages you want to print in the designated field.
  3. Online PDF Tools:

    • Websites like Smallpdf or PDF2Go allow you to select and print specific pages from a PDF.
  4. Command Line Tools:

    • pdftk: A command-line tool for PDF manipulation. Example command to print pages 5 to 10:
      pdftk input.pdf cat 5-10 output output.pdf
    • qpdf: Another command-line tool for manipulating PDFs. Example command:
      qpdf input.pdf --pages . 5-10 -- output.pdf
  5. Preview (macOS):

    • Open the PDF in Preview.
    • Go to File > Print.
    • In the Print dialog, enter the range of pages you want to print in the Pages field.

These tools should cover most of your needs for printing specific pages from a PDF.

dayton-outar commented 1 month ago

So ...

ocrmypdf took 30 minutes to redo ocr conversion of a 49.3MB file containing 302 pages (i.e. 2016 NCB Annual Report). It stressed my computer but it accurately generated an overlay of text for all pages and produced a 90.1MB file mimicking the input document. Excellent!

Now, JSE lists about 200 companies. This means 200 annual reports per year. Not to mention 3 quarterly reports for each company per year. It's noteworthy to mention that the document tested was the biggest file ever produced for NCB as an annual report. But for the tool to produce such an excellent result in 30 minutes is good. This will be a one-time operation that dumps information into the structured database.

So, now that we can extract information from the PDF documents, there still needs to be a more intelligent and optimal way to identify and extract only what is needed. So, the need for tools to read and split the pages out of the original document is needed. This is the reason for looking at qpdf.

dayton-outar commented 1 month ago

So ...

The following command took milliseconds to complete,

qpdf NCB-Annual-Report-2016.pdf --pages . 138-165 -- output.pdf

28 pages were extracted into the output document, output.pdf

dayton-outar commented 1 month ago

To determine which pages of a PDF contain scanned images using pdf2image, you can follow these general steps:

  1. Convert PDF Pages to Images: First, convert the pages of the PDF to image files using pdf2image. This will give you images of each page, which you can then analyze to check for scanned content.

    pdf2image -o output_directory input.pdf

    This command will convert each page of input.pdf into an image and save them in the specified output_directory.

  2. Analyze Images for Scanned Content: To check if the images contain text, you would use Optical Character Recognition (OCR) tools like Tesseract to recognize any text in the images. Here's an example workflow using Tesseract:

    • Install Tesseract (if not already installed):

      sudo apt-get install tesseract-ocr
    • Run Tesseract on Each Image: You can loop through the images and use Tesseract to see if it detects any text. For example, you could use a script to process each image file:

      for img in output_directory/*.png; do
       tesseract "$img" stdout > /dev/null
       if [ $? -ne 0 ]; then
           echo "No text found in $img"
       else
           echo "Text found in $img"
       fi
      done

      This script runs Tesseract on each image file, and checks if any text was detected. If text is found, it indicates the image might contain scanned text.

Note: There isn't a direct pdf2image command to locate scanned images; the process involves converting the PDF to images and then analyzing those images. If you need more advanced analysis, combining pdf2image with OCR tools like Tesseract will help you determine which pages contain scanned text.

dayton-outar commented 1 month ago

Sigh ...

A very challenging document to extract information from is going to be the 2016 and 2019 Proven Annual Reports. The challenge comes from a folding page concept that fits two pages of information into one PDF page with a landscape print layout.

dayton-outar commented 1 month ago

To kick off a C++ project that integrates external tools like ocrmypdf, pdfimages, qpdf, and a GPT API for analyzing financial statements, you can follow these key steps. The project will involve managing PDF manipulation and OCR processes while analyzing and transforming financial data into JSON payloads. You’ll use C++ as the core language and Autotools for build automation.

Step-by-Step Guide

1. Project Setup with Autotools

Autotools automates the build process for your C++ project. Here's how to set it up:

Directory structure:

   project-root/
   ├── src/
   ├── include/
   ├── configure.ac
   ├── Makefile.am
   ├── autogen.sh
   └── README.md

Initialize Autotools Files:

2. Integrating PDF Tools (ocrmypdf, pdfimages, qpdf)

These tools will be called from C++ using the system() function or popen() to capture output. Create a utility class to handle the calls to external tools.

Example: pdf_utility.cpp

   #include <iostream>
   #include <cstdlib>   // for system()
   #include <cstdio>    // for popen()

   void runOcrMyPdf(const std::string &inputPdf, const std::string &outputPdf) {
       std::string command = "ocrmypdf " + inputPdf + " " + outputPdf;
       system(command.c_str());
   }

   void runPdfImages(const std::string &inputPdf, const std::string &outputDir) {
       std::string command = "pdfimages -list " + inputPdf + " > " + outputDir + "/images.json";
       system(command.c_str());
   }

   void runQpdf(const std::string &inputPdf, const std::string &outputPdf) {
       std::string command = "qpdf --decrypt " + inputPdf + " " + outputPdf;
       system(command.c_str());
   }

3. Using the GPT API in C++

You can integrate the GPT API using a library like cURL for making HTTP requests.

Install cURL:

   sudo apt-get install libcurl4-openssl-dev

Example: gpt.cpp

   #include <iostream>
   #include <curl/curl.h>
   #include <nlohmann/json.hpp>

   std::string callGptApi(const std::string &apiKey, const std::string &prompt) {
       CURL *curl;
       CURLcode res;
       std::string response;

       curl = curl_easy_init();
       if(curl) {
           struct curl_slist *headers = NULL;
           headers = curl_slist_append(headers, "Content-Type: application/json");
           headers = curl_slist_append(headers, ("Authorization: Bearer " + apiKey).c_str());

           nlohmann::json jsonPayload = {
               {"model", "gpt-4"},
               {"prompt", prompt},
               {"max_tokens", 1000}
           };

           std::string data = jsonPayload.dump();
           curl_easy_setopt(curl, CURLOPT_URL, "https://api.openai.com/v1/completions");
           curl_easy_setopt(curl, CURLOPT_HTTPHEADER, headers);
           curl_easy_setopt(curl, CURLOPT_POSTFIELDS, data.c_str());

           // Lambda function to capture response
           curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, [](void *ptr, size_t size, size_t nmemb, std::string *data) {
               data->append((char *)ptr, size * nmemb);
               return size * nmemb;
           });
           curl_easy_setopt(curl, CURLOPT_WRITEDATA, &response);

           res = curl_easy_perform(curl);
           if(res != CURLE_OK)
               std::cerr << "curl_easy_perform() failed: " << curl_easy_strerror(res) << std::endl;

           curl_easy_cleanup(curl);
       }
       return response;
   }

4. Core Functionality: Reading PDFs and Analyzing Data

5. Building and Running the Project

After setting up the project structure and Autotools configuration, build and run the project:

   ./autogen.sh
   ./configure
   make
   ./pdf_analyzer input.pdf

6. Error Handling and Testing

7. Next Steps

This should provide you with a solid foundation to kick off the project. Let me know if you'd like additional details on any step!