BobLd / tabula-sharp

Extract tables from PDF files (port of tabula-java)
MIT License
159 stars 26 forks source link

Merged columns when extracting tables #16

Closed emrebiber closed 3 years ago

emrebiber commented 3 years ago

Hi @BobLd ,

Thanks for such an awesome library. In my pdf file, for some reason, two columns are merged (I attached a couple of images) when I'm trying to extract the table. I was wondering maybe you can help me to determine what might cause that issue. I can also send the pdf file via email since it has sensitive information. Thanks in advance.

pdf dotnet

Originally posted by @emrebiber in https://github.com/BobLd/tabula-sharp/issues/13#issuecomment-768074036

BobLd commented 3 years ago

Hi @emrebiber, thanks for raising an issue.

I think the issue could come from two sources:

Possible quick win: Can you try using camelot-sharp instead of tabula-sharp? This library's goal is the same but works differently. The nuget package is available, but only as a pre-release.

emrebiber commented 3 years ago

Hi again,

I did use SimpleNurminenDetectionAlgorithm and BasicExtractionAlgorithm. Here is the code snippet:

using (var document = PdfDocument.Open(model.File.InputStream, new ParsingOptions() { ClipPaths = true }))
{
      ObjectExtractor oe = new ObjectExtractor(document);
      PageArea pageArea = oe.Extract(1);
      var detector = new SimpleNurminenDetectionAlgorithm();
      var regions = detector.Detect(pageArea);
      IExtractionAlgorithm ea = new BasicExtractionAlgorithm();
      var tables = ea.Extract(pageArea.GetArea(regions[0].BoundingBox));
      var table = tables[0];
      var rows = table.Rows;
}

I also attached the pdf file that I'm trying. Test.pdf

I can for sure try camelot-sharp but do you have some kind of instructions on how can I try it?

BobLd commented 3 years ago

Looking at the pdf, my first guess would be that the 2 columns are merged together because of this word spanning 2 columns: image It is going to be difficult for the algo to know how to split that. In which column would you put this word?

Concerning Camelot, I took that from my tests:

Lattice lattice = new Lattice(new OpenCvImageProcesser(), new BasicSystemDrawingProcessor());
// In your case, it is better to use the Stream BaseParser (rather than the Lattice parser)
var tables = lattice.ExtractTables(doc.GetPage(1), layout_kwargs: null);
emrebiber commented 3 years ago

You are right that's the word merging those columns together. Thanks for your help. This library is so useful.