atlanhq / camelot

Camelot: PDF Table Extraction for Humans
https://camelot-py.readthedocs.io
Other
3.65k stars 357 forks source link

\n between every letter? #318

Closed QwertyCoolMT closed 5 years ago

QwertyCoolMT commented 5 years ago

hey there, One of the PDF's I'm trying to read is getting a newline between every letter within a given cell: image code used to create this output: camelot.read_pdf('ITEMS.pdf',pages='1',text_strip='\n', flavor='stream', table_areas=['20,530,600,150'],columns=['30,330,380,410,470,530'])

CartierPierre commented 5 years ago

text_strip='\n' ?

QwertyCoolMT commented 5 years ago

text_strip='\n' ?

code used to create this output: camelot.read_pdf('ITEMS.pdf',pages='1',text_strip='\n', flavor='stream', table_areas=['20,530,600,150'],columns=['30,330,380,410,470,530'])

CartierPierre commented 5 years ago

Yes, so it's normal your datas are splitted with '\n'. If the question is why every letters are splitted, maybe you should try to play with col_tol parameter ?

QwertyCoolMT commented 5 years ago

Cols are decided by coloumns list in this one.. I will try to play with it anyway.

Also wondering whether it could have to do with pdfminer’s sentence/word detection

QwertyCoolMT commented 5 years ago

ended up re-opening file and stripping out the \n's myself as i was unable to find a solution within the library that worked.

vinayak-mehta commented 5 years ago

Also wondering whether it could have to do with pdfminer’s sentence/word detection

Yes.

The strip_text kwarg will only strip characters from the start and end of a string.