atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.62k stars 350 forks source link

Update stream.py #316

Open CartierPierre opened 5 years ago

CartierPierre commented 5 years ago

Make the stream grouping row recursively, not only 2 by 2

vinayak-mehta commented 5 years ago

@CartierPierre The change you made is failing a test, can you look into it? And is there an issue associated with this PR?

CartierPierre commented 5 years ago

I just changed a tab ... It'm not a github expert yet, I don't know how to pass the test

vinayak-mehta commented 5 years ago

It's easy :) most of the tests in here are checking if the actual table extracted from a PDF is the same as the expected table. Your change fails test_stream_columns test because the the output after the change isn't matching the expected output.

The fix for this would be to check if the actual output that you're getting from your change, on your local machine, is in fact better/more correct than the current expected output. And then change the expected output in the file above, if that's the case.

vinayak-mehta commented 5 years ago

I was able to find the failing test by clicking on the Details link on the Travis check.

vinayak-mehta commented 5 years ago

Also, what do you mean by making stream row grouping recursive and not 2 by 2? Is there an issue associated with this?

CartierPierre commented 5 years ago

Before the fix, if the file is like :

Line1
Line2
Line3

Line4

The output was :

Line1\nLine2
Line3
Line4

But the expected output was :

Line1\nLine2\Line3
Line4
vinayak-mehta commented 5 years ago

Cool, I'll look into and and see if the test needs fixing.