BobLd / tabula-sharp

Extract tables from PDF files (port of tabula-java)
MIT License
161 stars 25 forks source link
csharp dotnet extract extract-table extracting-tables extraction extraction-engine netstandard pdf-table-extract pdf-table-extraction pdfparser pdfpig pdfs table table-extraction tabula tabula-java tabula-sharp

tabula-sharp

tabula-sharp is a library for extracting tables from PDF files — it is a port of tabula-java

Windows Linux Mac OS

NuGet packages available on the releases page and on www.nuget.org:

Differences with tabula-java

Usage

Stream mode - BasicExtractionAlgorithm

using (PdfDocument document = PdfDocument.Open("doc.pdf", new ParsingOptions() { ClipPaths = true }))
{
    ObjectExtractor oe = new ObjectExtractor(document);
    PageArea page = oe.Extract(1);

    // detect canditate table zones
    SimpleNurminenDetectionAlgorithm detector = new SimpleNurminenDetectionAlgorithm();
    var regions = detector.Detect(page);

    IExtractionAlgorithm ea = new BasicExtractionAlgorithm();
    List<Table> tables = ea.Extract(page.GetArea(regions[0].BoundingBox)); // take first candidate area
    var table = tables[0];
    var rows = table.Rows;
}

Lattice mode - SpreadsheetExtractionAlgorithm

using (PdfDocument document = PdfDocument.Open("doc.pdf", new ParsingOptions() { ClipPaths = true }))
{
    ObjectExtractor oe = new ObjectExtractor(document);
    PageArea page = oe.Extract(1);

    IExtractionAlgorithm ea = new SpreadsheetExtractionAlgorithm();
    List<Table> tables = ea.Extract(page);
    var table = tables[0];
    var rows = table.Rows;
}

Results

Stream mode - BasicExtractionAlgorithm

example

Lattice mode - SpreadsheetExtractionAlgorithm

example