Azure / azure-sdk-for-js

This repository is for active development of the Azure SDK for JavaScript (NodeJS & Browser). For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/javascript/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-js.
MIT License
2.03k stars 1.19k forks source link

DocumentAI: Malformed tables in markdown outputs #29071

Open zthomas opened 5 months ago

zthomas commented 5 months ago

Describe the bug When selecting queryParameters: { outputContentFormat: "markdown" } when more complex tables are rendered, it is very prone to formatting errors. This causes broken markdown tables or sometimes misaligned columns.

I would suggest that by default it should render the tables in HTML instead of markdown table syntax, markdown will still render HTML tables. Services like Unstructured default to rendering tables in HTML for accuracy.

To Reproduce Steps to reproduce the behavior: CEC sample.pdf

Expected behavior The tables should be rendered correctly in markdown. The API in does output the table correctly but the markdown render is really badly formatted. This makes the markdown output pretty useless for this scenario. More and more people are using DocumentAI for RAG ingestion and having the output in proper markdown is very useful.

**Example of a broken table (Markdown doesn't support colspan) Markdown Render:

Screenshot 2024-03-26 at 11 31 14 AM

hots** Table Render in HTML:

Screenshot 2024-03-26 at 11 31 01 AM

**Example where the markdown table causes misaligned columns: Markdown render:

Screenshot 2024-03-26 at 11 30 04 AM

Table Render in HTML:

Screenshot 2024-03-26 at 11 30 51 AM
github-actions[bot] commented 5 months ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @klaaslanghout.