Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.45k stars 581 forks source link

bug/infer_table_structure on docker with M1 chip #2089

Open snova-amitk opened 7 months ago

snova-amitk commented 7 months ago

Describe the bug The partition_pdf function errors with segmentation fault when infer_table_structure=True

To Reproduce Follow the docker instructions here: https://unstructured-io.github.io/unstructured/installation/docker.html

from unstructured.partition.pdf import partition_pdf elements = partition_pdf(filename="example-docs/layout-parser-paper-with-Table.pdf", infer_table_structure=True)

Expected behavior Not to segment fault.

Screenshots Downloading yolox_l0.05.onnx: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 217M/217M [00:14<00:00, 14.7MB/s] Downloading (…)lve/main/config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.47k/1.47k [00:00<00:00, 2.07MB/s] Downloading model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 115M/115M [00:07<00:00, 14.9MB/s] Downloading model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 46.8M/46.8M [00:03<00:00, 15.1MB/s] Some weights of the model checkpoint at microsoft/table-transformer-structure-recognition were not used when initializing TableTransformerForObjectDetection: ['model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']

Environment Info Using docker instructions.

badGarnet commented 7 months ago

Hi @snova-amitk , thanks for reporting this bug and we are tracking it. In the meantime if you can provide details of your hardware setup it would be very helpful (what kind of CPUs are they and what kind of instruction sets do they have).

snova-amitk commented 7 months ago
  Model Name: MacBook Pro
  Model Identifier: MacBookPro18,3
  Chip: Apple M1 Pro
  arm64 instruction set
badGarnet commented 7 months ago

@snova-amitk unfortunately we don't support apple ARM chips with docker image at the moment. The combination of different CPU architecture and OS results in incompatibility with the model binary and instruction set of the CPU. This shouldn't be a problem on an x86 CPU. We are tracking this problem but there is not plan for immediate resolution.

huvers commented 6 months ago

I'm seeing the same on ARM for Ubuntu with a pip install (no docker).

My hardware is an NVIDIA IGX Orin Devkit w/ A6000 dGPU.

CROmartin commented 3 months ago

On Mac Pro with an M1 chip, I have a similar issue when infer_table_structure is set to True. I am running rest Api Django server, if infer_table_structure is set to True then it kills my whole server(locally) without throwing any error so I am not sure what exactly the problem is. I am guessing incompatibility with the M1 chip, and ARM architecture in general. I have installed everything requested in: https://unstructured-io.github.io/unstructured/installation/full_installation.html, I am not using docker in this case.

partition_pdf works fine without infer_table_structure set to True.

@badGarnet why this incompatibility with ARM architecture isn't mentioned in the docs? It would save so much time not to doubt everything!

@badGarnet Can you confirm that I can't use infer_table_structure on ARM architecture?

I can confirm that everything worked fine after switching to x86 device.

a1ix2 commented 4 weeks ago

Same happens on Macbook Air M2 (MacOS Ventura 13.6.4). Everything was installed through pip in a separate environment, python kernel crashes after a while when infer_table_structure=True, otherwise everything works fine.