djbrown / hbscorez

HbScorez is a web application, which processes handball game reports of diverse handball associations, districts, and leagues. It analyzes the player scores and displays the statistics and rankings.
https://hbscorez.de
MIT License
19 stars 3 forks source link

Evaluate PDF parsing libraries #127

Open djbrown opened 1 year ago

djbrown commented 1 year ago

currently using https://github.com/chezou/tabula-py/tree/master problems:

Alternatives:

filgit commented 2 months ago

have tried both with camelot and tabula-py with the game reports. Both went well in my cases. Can you explain the "problem" cases? Is it related to correct_data.py ?

Why is needing java an issue and what are the corner cases you mention?

djbrown commented 1 month ago

@filgit the "problem" with java is that it's just an otherwise unnecessary technology/dependency in the system. though it doesn't have a big impact on the code base: https://github.com/djbrown/hbscorez/blob/00ee8a8b6b81a89798ae90f88e636e667219e2e7/README.md?plain=1#L51 https://github.com/djbrown/hbscorez/blob/00ee8a8b6b81a89798ae90f88e636e667219e2e7/Dockerfile#L8

for the "corner cases" I don't remember what exactly they were, but I have a long list of "erroneous reports", e.g. where some players names overflow the column/cell like here. I guess my hope was, that another library would handle them better (currently overflows are clipped off).

performance might be another reason, but that should be measured first to really count as an argument.

but this issue didn't really have priority, else it wouldn't celebrate birthday soon 😅

filgit commented 1 month ago

okay, see. With camelot you will have ghostscript as an additional dependency. And the error prone report you linked is a great example where the concept comes to its limits. I wonder, how the reports are created, btw.