Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.52k stars 697 forks source link

CCT `measure-table-structure-accuracy-command` doesn't drop index #2962

Open mallorih opened 4 months ago

mallorih commented 4 months ago

Describe the bug The CCT command measure-table-structure-accuracy-command doesn't drop the extra index when it doesn't find a table to process (i.e. the documents have the wrong format).

To Reproduce

PYTHONPATH=. python unstructured/ingest/evaluate.py measure-table-structure-accuracy-command --output_dir ground_truth_text_as_html --source_dir predicted_text_as_html --output_dir output_metrics

Expected behavior Screenshot 2024-05-02 at 3 50 36 PM

Screenshots Error

  File "/Users/mallori/unstructured/unstructured/ingest/evaluate.py", line 276, in <module>
    main()
  File "/Users/mallori/opt/anaconda3/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/Users/mallori/opt/anaconda3/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/Users/mallori/opt/anaconda3/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/mallori/opt/anaconda3/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/mallori/opt/anaconda3/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/Users/mallori/unstructured/unstructured/ingest/evaluate.py", line 236, in measure_table_structure_accuracy_command
    return measure_table_structure_accuracy(
  File "/Users/mallori/unstructured/unstructured/metrics/evaluate.py", line 375, in measure_table_structure_accuracy
    agg_df.columns = agg_headers
  File "/Users/mallori/opt/anaconda3/lib/python3.9/site-packages/pandas/core/generic.py", line 5915, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 69, in pandas._libs.properties.AxisProperty.__set__
  File "/Users/mallori/opt/anaconda3/lib/python3.9/site-packages/pandas/core/generic.py", line 823, in _set_axis
    self._mgr.set_axis(axis, labels)
  File "/Users/mallori/opt/anaconda3/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 230, in set_axis
    self._validate_set_axis(axis, new_labels)
  File "/Users/mallori/opt/anaconda3/lib/python3.9/site-packages/pandas/core/internals/base.py", line 70, in _validate_set_axis
    raise ValueError(
ValueError: Length mismatch: Expected axis has 6 elements, new values have 5 elements

Environment Info

Python version:  3.9.13
unstructured version:  0.13.3
unstructured-inference version:  0.7.23
pytesseract version:  0.3.10
Torch version:  2.1.0
Detectron2 version:  0.6
PaddleOCR is not installed
Libmagic version:  ==> libmagic: stable 5.45

Additional context Add any other context about the problem here.

mallorih commented 4 months ago

prediction_table_0_0.png.txt ground_truth_table_0_0.png.txt