Open dayton-outar opened 1 year ago
Implementing hot keys poses a challenge in that it clashes with the browsers hot keys. Maybe, should stick with point and click until a later date. Focus on more substantial value-added features for now.
To automate entry via parsing annual and quarterly report. We will need an OCR solution such as Asprise OCR. Such a tool should be able to read the following image and output the data in a data structure that can be injected into the system.
More reports that can be tested on by the automation.
More.
A code snippet for using the Asprise OCR,
using System;
using AspriseOCR;
namespace OCRDemo
{
class Program
{
static void Main(string[] args)
{
// Initialize the OCR engine
Ocr.Setup();
// Create an OCR instance
Ocr ocr = new Ocr();
// Start the OCR process with table detection enabled
string ocrResult = ocr.Recognize("path_to_your_image.jpg", -1, -1, -1, -1, -1,
Ocr.RECOGNIZE_TYPE_ALL, Ocr.OUTPUT_FORMAT_PLAINTEXT,
"prop.tableDetection=true|prop.tableExtractMethod=stream", "");
// Output the OCR result
Console.WriteLine(ocrResult);
// Clean up
ocr.Dispose();
}
}
}
So, today I have found some more options for invoice or tabular data extraction solutions that use OCR. They are as follows:
The ChatPDF online application serves as an example of the feature needed to embellish the graphical display of the annual report. So, with prompt engineering the line items in the financial statements can be classified and prepared for entry into the shillings database.
This issue will indeed operate in tandem with issue #51 and thus shares interest in a an extraction process that can intelligently classify the information provided in the annual reports that come in many variations.
Automating the parsing of audited financial statements to extrapolate items from the balance sheet, profit/loss, and cashflow statements involves several steps. Here's a structured approach to achieving this:
PDF Parsing:
pdfplumber
or Tabula
to extract tables.Example with pdfplumber
:
import pdfplumber
with pdfplumber.open("financial_statement.pdf") as pdf:
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
# Process table data
Excel Parsing:
Example with Pandas:
import pandas as pd
balance_sheet = pd.read_excel("financial_statement.xlsx", sheet_name="Balance Sheet")
profit_loss = pd.read_excel("financial_statement.xlsx", sheet_name="Profit and Loss")
cash_flow = pd.read_excel("financial_statement.xlsx", sheet_name="Cash Flow")
HTML Parsing:
Example with BeautifulSoup:
from bs4 import BeautifulSoup
with open("financial_statement.html") as file:
soup = BeautifulSoup(file, "html.parser")
tables = soup.find_all("table")
for table in tables:
# Process table data
Map Data: Map extracted data to the defined models.
class BalanceSheetItem:
def __init__(self, item, value):
self.item = item
self.value = value
balance_sheet_items = []
for index, row in balance_sheet.iterrows():
balance_sheet_items.append(BalanceSheetItem(row['Item'], row['Value']))
Here's a simple example combining some of the steps:
import pdfplumber
import pandas as pd
def parse_pdf(file_path):
data = {"balance_sheet": [], "profit_loss": [], "cash_flow": []}
with pdfplumber.open(file_path) as pdf:
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
# Process and classify table data into balance_sheet, profit_loss, or cash_flow
pass
return data
def main():
file_path = "financial_statement.pdf"
parsed_data = parse_pdf(file_path)
# Example processing for balance_sheet
balance_sheet_df = pd.DataFrame(parsed_data['balance_sheet'], columns=['Item', 'Value'])
profit_loss_df = pd.DataFrame(parsed_data['profit_loss'], columns=['Item', 'Value'])
cash_flow_df = pd.DataFrame(parsed_data['cash_flow'], columns=['Item', 'Value'])
# Further processing, storage, and analysis
print(balance_sheet_df)
print(profit_loss_df)
print(cash_flow_df)
if __name__ == "__main__":
main()
Handling different layouts of financial statements requires a flexible and adaptable approach in your automation. The strategy involves the following steps:
Identify Common Patterns and Variations: Analyze the different formats and identify common patterns, such as key section headers, common terms, and typical data structures.
Create Template-Based Parsing Rules: Develop a set of parsing rules or templates for each identified layout. Use these templates to guide the extraction process.
Implement Layout Detection: Build a mechanism to detect which layout a particular document follows. This can be based on specific keywords, section headers, or structural cues.
Modularize Parsing Logic: Structure your code in a modular way to handle different parsing rules based on the detected layout.
Here’s a step-by-step guide to building such an automation system:
import re
def detect_layout(text):
if "Balance Sheet" in text and "Profit and Loss" in text:
return "layout_1"
elif "Statement of Financial Position" in text and "Statement of Comprehensive Income" in text:
return "layout_2"
# Add more conditions for different layouts
else:
return "unknown"
def parse_layout_1(text):
balance_sheet = extract_section(text, "Balance Sheet", "Profit and Loss")
profit_loss = extract_section(text, "Profit and Loss", "Cash Flow Statement")
cash_flow = extract_section(text, "Cash Flow Statement", "End of Document")
return balance_sheet, profit_loss, cash_flow
def parse_layout_2(text):
balance_sheet = extract_section(text, "Statement of Financial Position", "Statement of Comprehensive Income")
profit_loss = extract_section(text, "Statement of Comprehensive Income", "Cash Flow Statement")
cash_flow = extract_section(text, "Cash Flow Statement", "End of Document")
return balance_sheet, profit_loss, cash_flow
def extract_section(text, start_marker, end_marker):
start_idx = text.find(start_marker)
end_idx = text.find(end_marker, start_idx)
if start_idx == -1 or end_idx == -1:
return ""
return text[start_idx:end_idx]
def parse_financial_statements(text):
layout = detect_layout(text)
if layout == "layout_1":
return parse_layout_1(text)
elif layout == "layout_2":
return parse_layout_2(text)
else:
raise ValueError("Unknown layout format")
def main():
# Example OCR text extraction from images
ocr_text = """...""" # OCR output text here
balance_sheet, profit_loss, cash_flow = parse_financial_statements(ocr_text)
# Convert to DataFrame for further processing
balance_sheet_df = pd.DataFrame(balance_sheet, columns=['Item', 'Value'])
profit_loss_df = pd.DataFrame(profit_loss, columns=['Item', 'Value'])
cash_flow_df = pd.DataFrame(cash_flow, columns=['Item', 'Value'])
print(balance_sheet_df)
print(profit_loss_df)
print(cash_flow_df)
if __name__ == "__main__":
main()
Preprocessing for OCR: Ensure images are preprocessed for better OCR accuracy (e.g., binarization, noise reduction).
Regular Expressions for Flexibility: Use regular expressions to capture variations in section headers and item formats.
Machine Learning for Layout Detection: Consider using machine learning models to classify document layouts based on features extracted from the text.
Continuous Improvement: As new formats are encountered, update the layout detection and parsing logic.
Error Handling: Implement robust error handling to manage unknown formats or parsing errors gracefully.
Documentation and Testing: Document different layouts and thoroughly test the parsing functions with various samples to ensure accuracy.
By following this structured approach, you can build a flexible automation system capable of handling various layouts of financial statements.
So, balance sheets or statement of financial positions (and income-expense and cashflow statements) consist of a list of descriptions of each aggregated item and the totals associated.
An LLM will be required to read the extracted data and classify each line item described in Sectional and Assays. So, each line item except those that computed lines like Total Assets and Total Liabilities, will have to be in a section of any shown below,
A line item can be associated with more than one Assays. For example, a line item that describes expenses such as Direct and Operating Expenses can be associated with Direct and Operating expenses. This is achieved in the API using bit flags.
So, major major developments happened during my break and I was inspired by How to use ChatGPT to analyze PDFs for free to try out a prompt on ChatGPT pertaining to this particular enhancement.
I decided to use a quarterly report from GraceKennedy. This document had no images in it. See below a snapshot of one of the 14 page document.
I uploaded the 14 page document and added the following prompt to associate with the document,
Give me a JSON string of the income statement that does not include aggregated items except earnings per stock unit (or eps or earnings per share) and categorize each item in the statement as an array item that consist of the following properties,
- The description of the line item
- The section within the statement that can be either: Income-Expense, Earnings per Stock
- Categorize the line item in all appropriate item classes from this list into an array property called analyte: Operating, Direct, Depreciation, Impairment, Interest, Other, Investment, Tax, Shareholders, NonControlling, Basic, Diluted, Cash, Inventories, Receivables, Payables, Current, Fixed, Loan, Intangible, Capital, Retained, Reserves, Dividends, Shares, Treasury, Wages, ExchangeRateChanges, Adjustments, Note.
- The amount associated, in the second column only, with that line item
The results were delightful.
This is what I got.
" Here is the JSON string representing the income statement with each item categorized as requested:
{
"income_statement": [
{
"description": "Revenue from products and services",
"section": "Income-Expense",
"analyte": ["Operating", "Direct", "Current"],
"amount": 92355966
},
{
"description": "Interest revenue",
"section": "Income-Expense",
"analyte": ["Investment", "Interest", "Other"],
"amount": 3420263
},
{
"description": "Direct and operating expenses",
"section": "Income-Expense",
"analyte": ["Operating", "Direct", "Wages", "Cash"],
"amount": 89499744
},
{
"description": "Net impairment losses on financial assets",
"section": "Income-Expense",
"analyte": ["Impairment", "Loan"],
"amount": 392404
},
{
"description": "Other income",
"section": "Income-Expense",
"analyte": ["Operating", "Other"],
"amount": 2712404
},
{
"description": "Interest income – non-financial services",
"section": "Income-Expense",
"analyte": ["Interest", "Other"],
"amount": 433576
},
{
"description": "Interest expense – non-financial services",
"section": "Income-Expense",
"analyte": ["Interest", "Loan", "Payables"],
"amount": 858262
},
{
"description": "Share of results of associates and joint ventures",
"section": "Income-Expense",
"analyte": ["Investment", "Shareholders"],
"amount": 368294
},
{
"description": "Taxation",
"section": "Income-Expense",
"analyte": ["Tax", "Cash"],
"amount": 2305826
},
{
"description": "Net Profit for the period",
"section": "Income-Expense",
"analyte": ["Operating", "Retained", "Shareholders"],
"amount": 6234267
},
{
"description": "Profit attributable to Owners of GraceKennedy Limited",
"section": "Income-Expense",
"analyte": ["Operating", "Retained", "Shareholders"],
"amount": 5677684
},
{
"description": "Profit attributable to Non-controlling interests",
"section": "Income-Expense",
"analyte": ["NonControlling", "Operating"],
"amount": 556583
},
{
"description": "Earnings per Stock Unit - Basic",
"section": "Earnings per Stock",
"analyte": ["Basic", "Shares", "Note"],
"amount": 5.73
},
{
"description": "Earnings per Stock Unit - Diluted",
"section": "Earnings per Stock",
"analyte": ["Diluted", "Shares", "Note"],
"amount": 5.69
}
]
}
This JSON structure includes each line item from the income statement, categorized by its section, analytes, and associated amounts. "
This saves a lot of time and opens up possibilities for efficient use of data entry time.
The big question is now: how big a file can GPT handle?
After realizing that GPT is recognizing the "first" column as first in it's category and not first overall, I corrected by prompt as follows,
Excellent. I just realized that I want this for the first column of amounts
"Here is the JSON string with the amounts from the first column:
{
"income_statement": [
{
"description": "Revenue from products and services",
"section": "Income-Expense",
"analyte": ["Operating", "Direct", "Current"],
"amount": 31275213
},
{
"description": "Interest revenue",
"section": "Income-Expense",
"analyte": ["Investment", "Interest", "Other"],
"amount": 1149675
},
{
"description": "Direct and operating expenses",
"section": "Income-Expense",
"analyte": ["Operating", "Direct", "Wages", "Cash"],
"amount": 30248373
},
{
"description": "Net impairment losses on financial assets",
"section": "Income-Expense",
"analyte": ["Impairment", "Loan"],
"amount": 162481
},
{
"description": "Other income",
"section": "Income-Expense",
"analyte": ["Operating", "Other"],
"amount": 1185127
},
{
"description": "Interest income – non-financial services",
"section": "Income-Expense",
"analyte": ["Interest", "Other"],
"amount": 157484
},
{
"description": "Interest expense – non-financial services",
"section": "Income-Expense",
"analyte": ["Interest", "Loan", "Payables"],
"amount": 300416
},
{
"description": "Share of results of associates and joint ventures",
"section": "Income-Expense",
"analyte": ["Investment", "Shareholders"],
"amount": 38715
},
{
"description": "Taxation",
"section": "Income-Expense",
"analyte": ["Tax", "Cash"],
"amount": 835635
},
{
"description": "Net Profit for the period",
"section": "Income-Expense",
"analyte": ["Operating", "Retained", "Shareholders"],
"amount": 2259309
},
{
"description": "Profit attributable to Owners of GraceKennedy Limited",
"section": "Income-Expense",
"analyte": ["Operating", "Retained", "Shareholders"],
"amount": 2090387
},
{
"description": "Profit attributable to Non-controlling interests",
"section": "Income-Expense",
"analyte": ["NonControlling", "Operating"],
"amount": 168922
},
{
"description": "Earnings per Stock Unit - Basic",
"section": "Earnings per Stock",
"analyte": ["Basic", "Shares", "Note"],
"amount": 2.11
},
{
"description": "Earnings per Stock Unit - Diluted",
"section": "Earnings per Stock",
"analyte": ["Diluted", "Shares", "Note"],
"amount": 2.09
}
]
}
This JSON structure now includes the amounts from the first column of the income statement. "
I find that uploading a 9.7 MB file takes less than a minute and reading it takes about 15 seconds.
However, after testing GPT with a document that contains images of the report failed to read the report accurately and produced inaccurate results.
So, for GPT took 2 minutes to upload 2016 NCB Annual Report, which was 49 MB, a relatively big file. This file seemed to have some layer of protection over the audited financial statement. So, GPT was able to read some but not all of the line items in the income statement. Some of the figures were accurate but some were inaccurate (you could see it reading the first part but not the entire figure).
This is not bad but to save time all the information need to be read accurately. So, it looks like whenever the textual information can be accessed in the PDF document, GPT will read it accurately. However, if the statement is in an image or protected, then GPT will deliver inaccurate and incomplete information.
So, attempting to use GPT to read 2020 NCB Annual Report, which was 14.5 MB.
This was more promising. The financial statements were not protected and were not placed in an image in the document. See below the results from same GPT prompt used for income statements,
Based on the provided NCB Financial Group Limited Annual Report 2020, I have extracted the relevant data from the income statement and structured it in the requested JSON format:
{
"IncomeStatement": [
{
"description": "Interest income",
"section": "Income-Expense",
"analyte": ["Operating", "Investment"],
"amount": 74421878
},
{
"description": "Interest expense",
"section": "Income-Expense",
"analyte": ["Operating", "Interest"],
"amount": -21932169
},
{
"description": "Net interest income",
"section": "Income-Expense",
"analyte": ["Operating", "Investment"],
"amount": 52489709
},
{
"description": "Fee and commission income",
"section": "Income-Expense",
"analyte": ["Operating"],
"amount": 25925325
},
{
"description": "Fee and commission expense",
"section": "Income-Expense",
"analyte": ["Operating"],
"amount": -4555918
},
{
"description": "Net fee and commission income",
"section": "Income-Expense",
"analyte": ["Operating"],
"amount": 21369407
},
{
"description": "Gain on foreign currency and investment activities",
"section": "Income-Expense",
"analyte": ["Operating", "Investment"],
"amount": 8793286
},
{
"description": "Credit impairment losses",
"section": "Income-Expense",
"analyte": ["Operating", "Impairment"],
"amount": -10284994
},
{
"description": "Dividend income",
"section": "Income-Expense",
"analyte": ["Operating", "Investment"],
"amount": 1901300
},
{
"description": "Other operating income",
"section": "Income-Expense",
"analyte": ["Operating", "Other"],
"amount": 2102190
},
{
"description": "Net result from banking and investment activities",
"section": "Income-Expense",
"analyte": ["Operating", "Investment"],
"amount": 76370898
},
{
"description": "Premium income",
"section": "Income-Expense",
"analyte": ["Operating", "Investment"],
"amount": 135202001
},
{
"description": "Insurance premium ceded to reinsurers",
"section": "Income-Expense",
"analyte": ["Operating"],
"amount": -42003606
},
{
"description": "Reinsurance commission income",
"section": "Income-Expense",
"analyte": ["Operating"],
"amount": 8470579
},
{
"description": "Net underwriting income",
"section": "Income-Expense",
"analyte": ["Operating"],
"amount": 101668974
},
{
"description": "Gross policyholders’ and annuitants’ benefits and reserves",
"section": "Income-Expense",
"analyte": ["Operating"],
"amount": -63223937
},
{
"description": "Reinsurance on policyholders’ and annuitants’ benefits and reserves",
"section": "Income-Expense",
"analyte": ["Operating"],
"amount": 7016737
},
{
"description": "Commission and other selling expenses",
"section": "Income-Expense",
"analyte": ["Operating"],
"amount": -13005783
},
{
"description": "Net result from insurance activities",
"section": "Income-Expense",
"analyte": ["Operating", "Investment"],
"amount": 32455991
},
{
"description": "Net operating income",
"section": "Income-Expense",
"analyte": ["Operating"],
"amount": 108826889
},
{
"description": "Staff costs",
"section": "Income-Expense",
"analyte": ["Operating", "Wages"],
"amount": 40526668
},
{
"description": "Depreciation and amortisation",
"section": "Income-Expense",
"analyte": ["Operating", "Depreciation"],
"amount": 8529471
},
{
"description": "Finance cost",
"section": "Income-Expense",
"analyte": ["Operating", "Interest"],
"amount": 1411727
},
{
"description": "Other operating expenses",
"section": "Income-Expense",
"analyte": ["Operating"],
"amount": 31097938
},
{
"description": "Operating Profit",
"section": "Income-Expense",
"analyte": ["Operating"],
"amount": 27261085
},
{
"description": "Taxation",
"section": "Income-Expense",
"analyte": ["Tax"],
"amount": -690064
},
{
"description": "NET PROFIT",
"section": "Income-Expense",
"analyte": ["Operating"],
"amount": 26883412
},
{
"description": "Earnings per stock unit",
"section": "Earnings per Stock",
"analyte": ["Shares"],
"amount": 8.01
}
]
}
This JSON structure adheres to the format you requested, categorizing each line item based on its description, section, analyte classification, and the associated amount.
If you want to avoid manual editing of text or tables and retain the format, you can follow this more automated approach:
Certain PDF tools are designed to extract content and retain the exact formatting automatically. These tools can convert the scanned images to text (via OCR) and preserve the layout (tables, images, fonts, etc.).
Here are a few options:
Tools > Enhance Scans
.Recognize Text
and select the option to recognize text in the entire document.You can then export the document as a new PDF while preserving the formatting.
ABBYY FineReader is known for its high-quality OCR and ability to keep the layout intact:
Export the document as a PDF. The software will automatically recreate the tables, fonts, and text boxes in the correct format.
Document > OCR
.Once the OCR completes, save the document as a new PDF.
You can also try online platforms like:
If you'd prefer an automated, programmatic approach, you can use scripting to process the PDF and maintain its format:
pdfplumber
or pdf2image
libraries to extract images and run OCR.pdfplumber
or PyMuPDF
to extract the layout data (positions, tables).reportlab
or pdfkit
libraries, adding text and images at the correct positions.Would you like help setting up one of these tools or methods?
Here's a Node.js script that reads the CSV file, processes each row, and generates SQL INSERT
commands for a table called [stocks].[dbo].[StockIndices]
. You'll need the csv-parser
package to read the CSV file.
Install the csv-parser
package:
npm install csv-parser
Here’s the script:
const fs = require('fs');
const csv = require('csv-parser');
// Path to your CSV file
const filePath = 'path/to/your/file.csv';
const insertSQL = [];
fs.createReadStream(filePath)
.pipe(csv())
.on('data', (row) => {
const { Date, MarketIndexNo, Value } = row;
// Create SQL INSERT statement for each row
const sql = `
INSERT INTO [stocks].[dbo].[StockIndices] (Date, MarketIndexNo, Value)
VALUES ('${Date}', ${MarketIndexNo}, ${Value});
`;
insertSQL.push(sql);
})
.on('end', () => {
console.log('CSV file processed. SQL commands:');
insertSQL.forEach(sql => console.log(sql));
// Optionally write the SQL commands to a file
fs.writeFileSync('insert_commands.sql', insertSQL.join('\n'));
})
.on('error', (err) => {
console.error('Error reading CSV file:', err);
});
csv-parser
.INSERT
statement, escaping the Date
, MarketIndexNo
, and Value
columns.insertSQL
array.insert_commands.sql
.If your CSV file has the following rows:
Date,MarketIndexNo,Value
2024-09-07,1,100.5
2024-09-08,2,101.7
It will generate:
INSERT INTO [stocks].[dbo].[StockIndices] (Date, MarketIndexNo, Value)
VALUES ('2024-09-07', 1, 100.5);
INSERT INTO [stocks].[dbo].[StockIndices] (Date, MarketIndexNo, Value)
VALUES ('2024-09-08', 2, 101.7);
Date
field isn't in the correct format for SQL Server, you may need to adjust it.Common Table Expression SELECT Query to get ValueChange
value from new market indices records,
WITH cte AS (
SELECT
i.[No],
i.[Date],
i.[MarketIndexNo],
i.[Value],
LAG(i.[Value]) OVER (PARTITION BY i.[MarketIndexNo] ORDER BY i.[Date]) AS PrevValue
FROM [stocks].[dbo].[StockIndices] i
)
SELECT
cte.[Date],
cte.[MarketIndexNo],
cte.[Value],
cte.[Value] - cte.[PrevValue] AS ValueChange,
si.[MarketIndexNo],
MI.[Name]
FROM [cte]
INNER JOIN [stocks].[dbo].[StockIndices] si ON si.[No] = cte.[No]
INNER JOIN [stocks].[dbo].[MarketIndices] mi ON mi.[No] = si.[MarketIndexNo]
WHERE cte.Date = si.Date
AND cte.MarketIndexNo = si.MarketIndexNo
AND si.[LogNo] = 176317;
Here's the update query that was eventually used,
WITH cte AS (
SELECT
i.[No],
i.[Date],
i.[MarketIndexNo],
i.[Value],
LAG(i.[Value]) OVER (PARTITION BY i.[MarketIndexNo] ORDER BY i.[Date]) AS PrevValue
FROM [stocks].[dbo].[StockIndices] i
)
UPDATE si
SET [ValueChange] = ISNULL(cte.[Value] - cte.PrevValue, 0)
FROM [cte]
INNER JOIN [stocks].[dbo].[StockIndices] si ON si.[No] = cte.[No]
INNER JOIN [stocks].[dbo].[MarketIndices] mi ON mi.[No] = si.[MarketIndexNo]
WHERE cte.[Date] = si.[Date]
AND cte.MarketIndexNo = si.MarketIndexNo
AND si.[LogNo] = 176317;
A query to review the data,
SELECT i.[Value]
, i.[ValueChange]
, i.[Date]
, l.[Logged]
, i.[MarketIndexNo]
, m.[Name]
FROM [stocks].[dbo].[StockIndices] i
INNER JOIN [stocks].[dbo].[Logs] l ON l.[No] = i.[LogNo]
INNER JOIN [stocks].[dbo].[MarketIndices] m ON m.[No] = i.[MarketIndexNo]
WHERE i.[LogNo] = 176317
ORDER BY l.[Logged] DESC
Dividends information is tricky in that many companies that were de-listed no longer have their corporate actions available on Jamaica Stock Exchange.
I printed a page of the scanned document of 2016 NCB Annual Report, in particular page 153 that has the "Consolidated Income Statement". I then used ocrmypdf
to convert the 2016 NCB Annual Report into a format that maintained the layout but was readable by the GPT (meaning I could extract the text).
The ocrmypdf
command was as follows,
ocrmypdf --redo-ocr NCB-Annual-Report-2016.pdf NCB-Annual-2016-converted.pdf
The output was positive. In fact, the output from the GPT delivered accurate readings,
[
{
"description": "Interest income",
"section": "Income-Expense",
"analyte": ["Operating", "Interest"],
"amount": 39156349
},
{
"description": "Interest expense",
"section": "Income-Expense",
"analyte": ["Operating", "Interest"],
"amount": -11032579
},
{
"description": "Net interest income",
"section": "Income-Expense",
"analyte": ["Operating", "Interest"],
"amount": 28123770
},
{
"description": "Fee and commission income",
"section": "Income-Expense",
"analyte": ["Operating"],
"amount": 13575872
},
{
"description": "Fee and commission expense",
"section": "Income-Expense",
"analyte": ["Operating"],
"amount": -2663405
},
{
"description": "Net fee and commission income",
"section": "Income-Expense",
"analyte": ["Operating"],
"amount": 10912467
},
{
"description": "Gain on foreign currency and investment activities",
"section": "Income-Expense",
"analyte": ["Investment"],
"amount": 4736122
},
{
"description": "Premium income",
"section": "Income-Expense",
"analyte": ["Other"],
"amount": 7991693
},
{
"description": "Dividend income",
"section": "Income-Expense",
"analyte": ["Investment"],
"amount": 149921
},
{
"description": "Other operating income",
"section": "Income-Expense",
"analyte": ["Other"],
"amount": 143736
},
{
"description": "Staff costs",
"section": "Income-Expense",
"analyte": ["Wages"],
"amount": 13809023
},
{
"description": "Provision for credit losses",
"section": "Income-Expense",
"analyte": ["Other"],
"amount": 612355
},
{
"description": "Policyholders’ and annuitants’ benefits and reserves",
"section": "Income-Expense",
"analyte": ["Reserves"],
"amount": 4292643
},
{
"description": "Depreciation and amortisation",
"section": "Income-Expense",
"analyte": ["Depreciation"],
"amount": 1899414
},
{
"description": "Impairment losses on securities",
"section": "Income-Expense",
"analyte": ["Impairment"],
"amount": 79765
},
{
"description": "Other operating expenses",
"section": "Income-Expense",
"analyte": ["Operating", "Other"],
"amount": 13348202
},
{
"description": "Share of profit of associates",
"section": "Income-Expense",
"analyte": ["Other"],
"amount": 832480
},
{
"description": "Loss on dilution of investment in associate",
"section": "Income-Expense",
"analyte": ["Investment"],
"amount": -50748
},
{
"description": "Taxation",
"section": "Income-Expense",
"analyte": ["Tax"],
"amount": -4479992
},
{
"description": "Net profit attributable to stockholders of the Bank",
"section": "Income-Expense",
"analyte": ["Shareholders"],
"amount": 15636446
},
{
"description": "Net profit attributable to non-controlling interest",
"section": "Income-Expense",
"analyte": ["NonControlling"],
"amount": -1187886
},
{
"description": "Basic earnings per stock unit",
"section": "Earnings per Stock",
"analyte": ["Basic", "Shares"],
"amount": 6.35
},
{
"description": "Diluted earnings per stock unit",
"section": "Earnings per Stock",
"analyte": ["Diluted", "Shares"],
"amount": 5.87
}
]
Yes, you can use pdfimages
, a command-line tool available in Ubuntu, to check for pages that contain scanned images in a PDF. This tool is part of the poppler-utils package and is useful for extracting images from PDF files, but you can also use it to identify which pages contain raster (scanned) images.
Install pdfimages (if not already installed):
sudo apt update
sudo apt install poppler-utils
Extract images from the PDF:
Run pdfimages
on your PDF file to extract all the images and see which pages contain them. For example:
pdfimages -list input.pdf
This command lists all the images in the PDF, along with their respective page numbers. The output includes:
page num type width height color comp bpc enc interp object ID x-ppi y-ppi size ratio
--------------------------------------------------------------------------------------------
1 0 image 1654 2339 rgb 3 8 image no 9 0 300 300 251K 4.1%
2 1 image 1654 2339 rgb 3 8 image no 14 0 300 300 245K 4.0%
image
means a raster image).pdfimages
output.pdfimages -list input.pdf | grep -E "^ *[0-9]+ +image"
This will list only the pages where raster images are found.
If you need more sophisticated analysis (e.g., identifying whether a page is purely scanned without text layers), you can combine this with:
poppler-utils
): This can tell you if a page has text content (i.e., fonts), which means it’s not purely scanned. Run:
pdffonts input.pdf
pdfimages
is the best tool for identifying which pages in a PDF contain scanned images by listing all images and their associated pages.
You can capture the output from the pdfimages
command and transform it into a JSON format using shell tools like awk
or sed
, along with a simple script. Here’s a method to do this in Ubuntu.
First, you can run the pdfimages
command and parse its output, converting it into JSON format. Here’s a simple bash script to achieve that.
Run the pdfimages -list
command to get the output.
pdfimages -list input.pdf
Use Bash with Awk or Sed to format the output into JSON.
Here’s an example bash script that converts the pdfimages -list
output into JSON:
#!/bin/bash
# Get the list of images from the PDF
pdfimages -list "$1" | awk '
BEGIN {
# Initialize the JSON array
print "["
first = 1
}
NR > 2 {
# Skip the header lines
if (first == 0) {
print ","
}
first = 0
# Print each entry in JSON format
printf " {\n"
printf " \"page\": %d,\n", $1
printf " \"num\": %d,\n", $2
printf " \"type\": \"%s\",\n", $3
printf " \"width\": %d,\n", $4
printf " \"height\": %d,\n", $5
printf " \"color\": \"%s\",\n", $6
printf " \"comp\": %d,\n", $7
printf " \"bpc\": %d,\n", $8
printf " \"enc\": \"%s\",\n", $9
printf " \"interp\": \"%s\",\n", $10
printf " \"object_id\": \"%d %d\",\n", $11, $12
printf " \"x_ppi\": %d,\n", $13
printf " \"y_ppi\": %d,\n", $14
printf " \"size\": \"%s\",\n", $15
printf " \"ratio\": \"%s\"\n", $16
printf " }"
}
END {
# End the JSON array
print "\n]"
}'
pdfimages_to_json.sh
.chmod +x pdfimages_to_json.sh
./pdfimages_to_json.sh input.pdf > output.json
This script will output the pdfimages
results in JSON format, and the result will look like this:
[
{
"page": 1,
"num": 0,
"type": "image",
"width": 1654,
"height": 2339,
"color": "rgb",
"comp": 3,
"bpc": 8,
"enc": "image",
"interp": "no",
"object_id": "9 0",
"x_ppi": 300,
"y_ppi": 300,
"size": "251K",
"ratio": "4.1%"
},
{
"page": 2,
"num": 1,
"type": "image",
"width": 1654,
"height": 2339,
"color": "rgb",
"comp": 3,
"bpc": 8,
"enc": "image",
"interp": "no",
"object_id": "14 0",
"x_ppi": 300,
"y_ppi": 300,
"size": "245K",
"ratio": "4.0%"
}
]
pdfimages -list
to get the image information from the PDF.pdfimages
.You can now use this JSON for further processing or analysis! Let me know if you need any additional modifications to the script.
To print specific pages from a PDF document, you can use several tools depending on your needs and the platform you're on:
Adobe Acrobat Reader:
PDF Software (e.g., Foxit Reader, PDF-XChange Editor):
Online PDF Tools:
Command Line Tools:
pdftk input.pdf cat 5-10 output output.pdf
qpdf input.pdf --pages . 5-10 -- output.pdf
Preview (macOS):
These tools should cover most of your needs for printing specific pages from a PDF.
ocrmypdf
took 30 minutes to redo ocr conversion of a 49.3MB file containing 302 pages (i.e. 2016 NCB Annual Report). It stressed my computer but it accurately generated an overlay of text for all pages and produced a 90.1MB file mimicking the input document. Excellent!
Now, JSE lists about 200 companies. This means 200 annual reports per year. Not to mention 3 quarterly reports for each company per year. It's noteworthy to mention that the document tested was the biggest file ever produced for NCB as an annual report. But for the tool to produce such an excellent result in 30 minutes is good. This will be a one-time operation that dumps information into the structured database.
So, now that we can extract information from the PDF documents, there still needs to be a more intelligent and optimal way to identify and extract only what is needed. So, the need for tools to read and split the pages out of the original document is needed. This is the reason for looking at qpdf
.
The following command took milliseconds to complete,
qpdf NCB-Annual-Report-2016.pdf --pages . 138-165 -- output.pdf
28 pages were extracted into the output document, output.pdf
To determine which pages of a PDF contain scanned images using pdf2image
, you can follow these general steps:
Convert PDF Pages to Images:
First, convert the pages of the PDF to image files using pdf2image
. This will give you images of each page, which you can then analyze to check for scanned content.
pdf2image -o output_directory input.pdf
This command will convert each page of input.pdf
into an image and save them in the specified output_directory
.
Analyze Images for Scanned Content: To check if the images contain text, you would use Optical Character Recognition (OCR) tools like Tesseract to recognize any text in the images. Here's an example workflow using Tesseract:
Install Tesseract (if not already installed):
sudo apt-get install tesseract-ocr
Run Tesseract on Each Image: You can loop through the images and use Tesseract to see if it detects any text. For example, you could use a script to process each image file:
for img in output_directory/*.png; do
tesseract "$img" stdout > /dev/null
if [ $? -ne 0 ]; then
echo "No text found in $img"
else
echo "Text found in $img"
fi
done
This script runs Tesseract on each image file, and checks if any text was detected. If text is found, it indicates the image might contain scanned text.
Note: There isn't a direct pdf2image
command to locate scanned images; the process involves converting the PDF to images and then analyzing those images. If you need more advanced analysis, combining pdf2image
with OCR tools like Tesseract will help you determine which pages contain scanned text.
A very challenging document to extract information from is going to be the 2016 and 2019 Proven Annual Reports. The challenge comes from a folding page concept that fits two pages of information into one PDF page with a landscape print layout.
To kick off a C++ project that integrates external tools like ocrmypdf
, pdfimages
, qpdf
, and a GPT API for analyzing financial statements, you can follow these key steps. The project will involve managing PDF manipulation and OCR processes while analyzing and transforming financial data into JSON payloads. You’ll use C++ as the core language and Autotools for build automation.
Autotools automates the build process for your C++ project. Here's how to set it up:
Directory structure:
project-root/
├── src/
├── include/
├── configure.ac
├── Makefile.am
├── autogen.sh
└── README.md
Initialize Autotools Files:
configure.ac
: Contains macros and checks for your project setup.Makefile.am
: Defines how to build the project.
Create configure.ac
:
AC_INIT([Financial PDF Analyzer], [0.1], [your-email@example.com])
AM_INIT_AUTOMAKE
AC_PROG_CXX
AC_PROG_RANLIB
AC_PROG_LIBTOOL
AC_CONFIG_FILES([Makefile src/Makefile])
AC_OUTPUT
Create Makefile.am
:
SUBDIRS = src
Create src/Makefile.am
:
bin_PROGRAMS = pdf_analyzer
pdf_analyzer_SOURCES = main.cpp pdf_utility.cpp ocr.cpp gpt.cpp
Autotools Setup:
autogen.sh
:
#!/bin/sh
autoreconf --install
./configure
make
autogen.sh
to set up the project.ocrmypdf
, pdfimages
, qpdf
)These tools will be called from C++ using the system() function or popen() to capture output. Create a utility class to handle the calls to external tools.
Example: pdf_utility.cpp
#include <iostream>
#include <cstdlib> // for system()
#include <cstdio> // for popen()
void runOcrMyPdf(const std::string &inputPdf, const std::string &outputPdf) {
std::string command = "ocrmypdf " + inputPdf + " " + outputPdf;
system(command.c_str());
}
void runPdfImages(const std::string &inputPdf, const std::string &outputDir) {
std::string command = "pdfimages -list " + inputPdf + " > " + outputDir + "/images.json";
system(command.c_str());
}
void runQpdf(const std::string &inputPdf, const std::string &outputPdf) {
std::string command = "qpdf --decrypt " + inputPdf + " " + outputPdf;
system(command.c_str());
}
You can integrate the GPT API using a library like cURL for making HTTP requests.
Install cURL:
sudo apt-get install libcurl4-openssl-dev
Example: gpt.cpp
#include <iostream>
#include <curl/curl.h>
#include <nlohmann/json.hpp>
std::string callGptApi(const std::string &apiKey, const std::string &prompt) {
CURL *curl;
CURLcode res;
std::string response;
curl = curl_easy_init();
if(curl) {
struct curl_slist *headers = NULL;
headers = curl_slist_append(headers, "Content-Type: application/json");
headers = curl_slist_append(headers, ("Authorization: Bearer " + apiKey).c_str());
nlohmann::json jsonPayload = {
{"model", "gpt-4"},
{"prompt", prompt},
{"max_tokens", 1000}
};
std::string data = jsonPayload.dump();
curl_easy_setopt(curl, CURLOPT_URL, "https://api.openai.com/v1/completions");
curl_easy_setopt(curl, CURLOPT_HTTPHEADER, headers);
curl_easy_setopt(curl, CURLOPT_POSTFIELDS, data.c_str());
// Lambda function to capture response
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, [](void *ptr, size_t size, size_t nmemb, std::string *data) {
data->append((char *)ptr, size * nmemb);
return size * nmemb;
});
curl_easy_setopt(curl, CURLOPT_WRITEDATA, &response);
res = curl_easy_perform(curl);
if(res != CURLE_OK)
std::cerr << "curl_easy_perform() failed: " << curl_easy_strerror(res) << std::endl;
curl_easy_cleanup(curl);
}
return response;
}
ocrmypdf
to extract text from scanned PDFs.pdfimages
will help you detect and extract images from PDFs.qpdf
will be useful for handling encrypted or secured PDFs.Once you have text data (from OCR or extracted content), use the GPT API to analyze and convert it into structured data like JSON.
Example: main.cpp
#include "pdf_utility.h"
#include "gpt.h"
#include <iostream>
int main(int argc, char *argv[]) {
if (argc < 2) {
std::cerr << "Usage: pdf_analyzer <input_pdf>" << std::endl;
return 1;
}
std::string inputPdf = argv[1];
std::string outputPdf = "output.pdf";
// Run PDF tools
runOcrMyPdf(inputPdf, outputPdf);
runPdfImages(inputPdf, "outputDir");
runQpdf(inputPdf, "output_decrypted.pdf");
// Call GPT API for analysis
std::string apiKey = "your-gpt-api-key";
std::string analysis = callGptApi(apiKey, "Analyze this financial statement");
std::cout << "Analysis: " << analysis << std::endl;
return 0;
}
After setting up the project structure and Autotools configuration, build and run the project:
./autogen.sh
./configure
make
./pdf_analyzer input.pdf
ocrmypdf
, pdfimages
, qpdf
) and your application by providing sample PDFs.This should provide you with a solid foundation to kick off the project. Let me know if you'd like additional details on any step!
To make manual entry of statement items easier for users, the following is suggested,