BobLd / tabula-sharp

Extract tables from PDF files (port of tabula-java)
MIT License
159 stars 26 forks source link

Sometimes SpreadsheetExtractionAlgorithm ignores last row. #25

Open esencu opened 1 year ago

esencu commented 1 year ago

This code in Tabula.PageArea.GetArea method adds to PageArea instance horizontal ruling from right to left https://github.com/BobLd/tabula-sharp/blob/fe6e6e59be7f44130102737e10000abe1b15b3dd/Tabula/PageArea.cs#L161-L163 . It leads to situation when Tabula.Ruling.SortObjectComparer order objects in invalid order. As a result, the list of intersection is returned by Tabula.Ruling.FindIntersections is invalid and result of Tabula.Extractors.SpreadsheetExtractionAlgorithm.FindCells does not contains some cells that it should.

As a fix: Fix Tabula.PageArea.GetArea method

rv.AddRuling(new Ruling(
    new PdfPoint(rv.Left, rv.Bottom),
    new PdfPoint(rv.Right, rv.Bottom)));