atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.62k stars 350 forks source link

bidirectional language #352

Closed zamankul74 closed 4 years ago

zamankul74 commented 5 years ago

When I'm starting the example code with PDF file that has bidirectional language (such as arabic or hebrew) camelot converts tables but reverting strings content in CSV. For example in CSV: "help me",,,"I am in trouble" will be as: "em pleh",,,"elbuort ni ma I"

Real example CSV result: "םיאנת","","","","","ףיעס" "תישיא המאתהב תופורת","","","","","חוטיב

I made some small changes in provided code example Here changed code: # −∗− coding: utf−8 −∗− import camelot tables = camelot.read_pdf('my.pdf', pages='all', encoding='utf-8', multiple_tables=True) tables[0].parsing_report {'accuracy': 99.02, 'whitespace': 12.24, 'order': 1, 'page': 1} tables[0].to_csv('my.csv')

When I'm using regular tabula tool texts|strings are NOT reverted. So it's only camelot issue What is the solution? Or may be I need some changes in parameters to may it work as expected?

vinayak-mehta commented 4 years ago

@zamankul74 Can you please upload the PDF in which you're facing this issue?