BobLd / tabula-sharp

Extract tables from PDF files (port of tabula-java)
MIT License
159 stars 26 forks source link

Extract table issue - don't get an underscore #17

Closed Crypto49 closed 3 years ago

Crypto49 commented 3 years ago

Hi @BobLd,

Many thanks also from me for the helpful library!

I used the SpreadsheetExtractionAlgorithm try to extract tables from a PDF file, a sample attached.

The names from the table are intended to represent variable names for a computer program.

Unfortunately, the result is shown without the underscore:

I get "Content" instead of "C_ontent"

Example

After c.SetTextElements(TextElement.MergeWords(page.GetText(c.BoundingBox))); was called in SpreadsheetExtractionAlgorithm.cs, this.textElements = textElements; is executed in public void SetTextElements(List<T> textElements){} in RectangularTextContainer.cs.

I found out that the order is exchanged when assigning values ​​in this.textElements = textElements;:

List of values ​​before assignment:

assigning_1

List of values ​​after assignment:

assigning_2

Later the back part will be cut off, I think.

Unfortunately I haven't found a solution. I would be very grateful for any help.

Example PDF.pdf

BobLd commented 3 years ago

Hi @Crypto49, thanks for your feed. I gave your document a try and it appears the problem comes from the GetText() function: https://github.com/BobLd/tabula-sharp/blob/343a387e6052825c7832a1ad6709efaa62f5e52d/Tabula/Cell.cs#L64-L84

It uses the ILL_DEFINED_ORDER() sorting scheme that is far from perfect apparently (hence the name 😄, it's taken form the original java library).

Short therm, I would recommend you implement you own GetText() function. On my side, I will need to find a way to give more flexibility with that. One negative aspect I can already spot is that the sorting is done in-place, which is why the order of the character is changed (if you don't use GetText(), the order is correct):

Utils.Sort(this.textElements, new ILL_DEFINED_ORDER());
Crypto49 commented 3 years ago

Many thanks for finding the cause of the problem.

During the analysis I noticed that the vertical text with an underscore is recognized correctly when useLineReturns is false (a sample attached). In my function, I just left it out. I'm using the textElements[].Left property to reorder the horizontal text.

Granted, I'm not a C # expert, so I overwritten the function as follows:

        /// <summary>
        /// Gets the cell's text.
        /// </summary>
        /// <param name="useLineReturns"></param>
        public override string GetText(bool useLineReturns)
        {
            if (base.textElements.Count == 0)
            {
                return "";
            }

            StringBuilder sb = new StringBuilder();
            bool textAlignmentHorizontal = this.BoundingBox.Width > this.BoundingBox.Height ? true : false;

            if (textAlignmentHorizontal)
            {
                this.textElements.Sort((a,b) => a.Left.CompareTo(b.Left));
            }

            foreach (TextChunk tc in this.textElements)
            {
                sb.Append(tc.GetText());
            }

            return sb.ToString().Trim();
        }

I hope my solution doesn't affect other parts of the program, but for me it works.

Thanks again for your help!

Example 2.pdf