SuleyNL / Extractable

Table extraction library
MIT License
21 stars 2 forks source link

ParseError: Fails to extract tables #237

Open littlebuddha16 opened 2 days ago

littlebuddha16 commented 2 days ago

It's able to parse the PDF, detect tables and recognize the structures but exits with the following error.

{
    "name": "ParseError",
    "message": "not well-formed (invalid token): line 1, column 46 (<string>)",
    "stack": "Traceback (most recent call last):

  File ~/Documents/GitHub/pdf_two_table/extract_table/lib/python3.10/site-packages/IPython/core/interactiveshell.py:3577 in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  Cell In[3], line 7
    ex.extract(input_file=input_file, output_dir=output_dir, mode=ex.Mode.PRESENTATION)

  File ~/Documents/GitHub/pdf_two_table/extract_table/lib/python3.10/site-packages/extractable/Extractor.py:13 in extract
    return extract_using_TATR(input_file, output_dir, output_filetype, mode)

  File ~/Documents/GitHub/pdf_two_table/extract_table/lib/python3.10/site-packages/extractable/Extractor.py:34 in extract_using_TATR
    pipeline(data_object)

  File ~/Documents/GitHub/pdf_two_table/extract_table/lib/python3.10/site-packages/toolz/functoolz.py:489 in __call__
    ret = f(ret)

  File ~/Documents/GitHub/pdf_two_table/extract_table/lib/python3.10/site-packages/toolz/functoolz.py:489 in __call__
    ret = f(ret)

  File ~/Documents/GitHub/pdf_two_table/extract_table/lib/python3.10/site-packages/extractable/TextExtractor.py:99 in process
    table_xml = ET.fromstring(table.to_xml_with_coords())

  File ~/Documents/GitHub/pdf_two_table/extract_table/lib/python3.10/site-packages/extractable/Datatypes/Table.py:37 in to_xml_with_coords
    row_element = ET.fromstring(row.to_xml_with_coords())

  File ~/Documents/GitHub/pdf_two_table/extract_table/lib/python3.10/site-packages/extractable/Datatypes/Row.py:40 in to_xml_with_coords
    cell_element = ET.fromstring(cell.to_xml_with_coords())

  File /Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/xml/etree/ElementTree.py:1342 in XML
    parser.feed(text)

  File <string>
ParseError: not well-formed (invalid token): line 1, column 46
"
}