atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.64k stars 355 forks source link

Question: Why is position of first character of input row changed to last character of the same line in output table? #215

Closed amdhacks closed 5 years ago

amdhacks commented 5 years ago

Will share input file shortly. Unable to share from restricted network at this moment.

amdhacks commented 5 years ago

You can see in snapshot below that first character from row no 2-10 is cropped and append as last character in output table. snapshot: issue

The input files is : document-page3.pdf

Any idea,why this is happening and how can it be fixed.

Thanks.

abhibisht89 commented 5 years ago

i am also having the same issue related to first character position.

vinayak-mehta commented 5 years ago

Hi @amdhacks and @abhibisht89! Thanks for the report, this is a known issue (more details here #170 and here #213). You can expect a fix by the end of this week. Till then, you can try the workaround mentioned in #170, by finding the path where camelot is installed and passing detect_vertical=False in base.py.

amdhacks commented 5 years ago

My base.py file after the change given below:

# -*- coding: utf-8 -*-

import os

from ..utils import get_page_layout, get_text_objects

class BaseParser(object):
    """Defines a base parser.
    """
    def _generate_layout(self, filename):
        self.filename = filename
        self.layout, self.dimensions = get_page_layout(
            self.filename,
            char_margin=self.char_margin,
            line_margin=self.line_margin,
            word_margin=self.word_margin)
        self.horizontal_text = get_text_objects(self.layout, ltype="lh")
        self.vertical_text = get_text_objects(self.layout, ltype="lv")
        self.pdf_width, self.pdf_height = self.dimensions
        self.rootname, __ = os.path.splitext(self.filename)
        self.detect_vertical = False

But I do not see any improvement. First character is still shown as last character of the same row in output.

Am I missing something?

vinayak-mehta commented 5 years ago

Sorry, I should've been more specific. You need to add detect_vertical=True to get_page_layout. You can check out its definition in utils.py.

vinayak-mehta commented 5 years ago

@amdhacks This is fixed now. It will be more configurable after #170.

amdhacks commented 5 years ago

@vinayak-mehta, I have updated camelot version to 0.4.1 but first character which was showing as last earlier is now appearing as a single character in first column like below: Output now: C lass A
N et Asset Value at 31 December 5,111,372 N umber of outstanding units at 31 December 49,136 N et Asset Value per unit at 31 December 104.03 C lass B
N et Asset Value at 31 December 49,144,825 N umber of outstanding units at 31 December 471,555 N et Asset Value per unit at 31 December 104.22

Please suggest. Thanks.

vinayak-mehta commented 5 years ago

@amdhacks Please install the latest version i.e. v0.5.0. The table isn't being detected correctly for this case. You'll need to specify a table area.

$ camelot --format csv --output output.csv stream -T 70,690,550,170 input.pdf
sjm20066 commented 4 years ago

@vinayak-mehta I'm having the exact same issue as @amdhacks, and I have version 0.7.3 The Output has the same issue as: C lass A N et Asset Value at 31 December 5,111,372 N umber of outstanding units at 31 December 49,136 N et Asset Value per unit at 31 December 104.03 C lass B N et Asset Value at 31 December 49,144,825 N umber of outstanding units at 31 December 471,555 N et Asset Value per unit at 31 December 104.22

Also, the table boundaries in the Input PDF as as clearly defined as they can be. Any solution for this?