Closed emrebiber closed 3 years ago
Hi @emrebiber, thanks for raising an issue.
I think the issue could come from two sources:
Possible quick win: Can you try using camelot-sharp instead of tabula-sharp? This library's goal is the same but works differently. The nuget package is available, but only as a pre-release.
Hi again,
DLAViewer I did run it and attached xml also here I hope this will help. Test.txt
tabula-sharp
I did use SimpleNurminenDetectionAlgorithm and BasicExtractionAlgorithm. Here is the code snippet:
using (var document = PdfDocument.Open(model.File.InputStream, new ParsingOptions() { ClipPaths = true }))
{
ObjectExtractor oe = new ObjectExtractor(document);
PageArea pageArea = oe.Extract(1);
var detector = new SimpleNurminenDetectionAlgorithm();
var regions = detector.Detect(pageArea);
IExtractionAlgorithm ea = new BasicExtractionAlgorithm();
var tables = ea.Extract(pageArea.GetArea(regions[0].BoundingBox));
var table = tables[0];
var rows = table.Rows;
}
I also attached the pdf file that I'm trying. Test.pdf
I can for sure try camelot-sharp but do you have some kind of instructions on how can I try it?
Looking at the pdf, my first guess would be that the 2 columns are merged together because of this word spanning 2 columns: It is going to be difficult for the algo to know how to split that. In which column would you put this word?
Concerning Camelot, I took that from my tests:
Lattice lattice = new Lattice(new OpenCvImageProcesser(), new BasicSystemDrawingProcessor());
// In your case, it is better to use the Stream BaseParser (rather than the Lattice parser)
var tables = lattice.ExtractTables(doc.GetPage(1), layout_kwargs: null);
You are right that's the word merging those columns together. Thanks for your help. This library is so useful.
Hi @BobLd ,
Thanks for such an awesome library. In my pdf file, for some reason, two columns are merged (I attached a couple of images) when I'm trying to extract the table. I was wondering maybe you can help me to determine what might cause that issue. I can also send the pdf file via email since it has sensitive information. Thanks in advance.
Originally posted by @emrebiber in https://github.com/BobLd/tabula-sharp/issues/13#issuecomment-768074036