Closed Crypto49 closed 3 years ago
Hi @Crypto49, thanks for your feed. I gave your document a try and it appears the problem comes from the GetText()
function:
https://github.com/BobLd/tabula-sharp/blob/343a387e6052825c7832a1ad6709efaa62f5e52d/Tabula/Cell.cs#L64-L84
It uses the ILL_DEFINED_ORDER()
sorting scheme that is far from perfect apparently (hence the name 😄, it's taken form the original java library).
Short therm, I would recommend you implement you own GetText()
function.
On my side, I will need to find a way to give more flexibility with that. One negative aspect I can already spot is that the sorting is done in-place, which is why the order of the character is changed (if you don't use GetText()
, the order is correct):
Utils.Sort(this.textElements, new ILL_DEFINED_ORDER());
Many thanks for finding the cause of the problem.
During the analysis I noticed that the vertical text with an underscore is recognized correctly when useLineReturns
is false (a sample attached).
In my function, I just left it out.
I'm using the textElements[].Left
property to reorder the horizontal text.
Granted, I'm not a C # expert, so I overwritten the function as follows:
/// <summary>
/// Gets the cell's text.
/// </summary>
/// <param name="useLineReturns"></param>
public override string GetText(bool useLineReturns)
{
if (base.textElements.Count == 0)
{
return "";
}
StringBuilder sb = new StringBuilder();
bool textAlignmentHorizontal = this.BoundingBox.Width > this.BoundingBox.Height ? true : false;
if (textAlignmentHorizontal)
{
this.textElements.Sort((a,b) => a.Left.CompareTo(b.Left));
}
foreach (TextChunk tc in this.textElements)
{
sb.Append(tc.GetText());
}
return sb.ToString().Trim();
}
I hope my solution doesn't affect other parts of the program, but for me it works.
Thanks again for your help!
Hi @BobLd,
Many thanks also from me for the helpful library!
I used the SpreadsheetExtractionAlgorithm try to extract tables from a PDF file, a sample attached.
The names from the table are intended to represent variable names for a computer program.
Unfortunately, the result is shown without the underscore:
I get "Content" instead of "C_ontent"
After
c.SetTextElements(TextElement.MergeWords(page.GetText(c.BoundingBox)));
was called in SpreadsheetExtractionAlgorithm.cs,this.textElements = textElements;
is executed inpublic void SetTextElements(List<T> textElements){}
in RectangularTextContainer.cs.I found out that the order is exchanged when assigning values in
this.textElements = textElements;
:List of values before assignment:
List of values after assignment:
Later the back part will be cut off, I think.
Unfortunately I haven't found a solution. I would be very grateful for any help.
Example PDF.pdf