Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.02k stars 742 forks source link

bug/Wrong parsing of html, xml code blocks in markdown #3578

Open cgjosephlee opened 2 months ago

cgjosephlee commented 2 months ago

Describe the bug HTML and XML code blocks in markdown are not parsed properly.

Results:

HTML Example
```html
Hello, World!
This is a simple HTML example.
```
XML Example
xml <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>
```xml
```
```xml
```

To Reproduce

## HTML Example

```html
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>Sample HTML</title>
</head>
<body>
    <h1>Hello, World!</h1>
    <p>This is a simple HTML example.</p>
</body>
</html>
```

## XML Example

```xml
<note>
    <to>Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
</note>
```

```xml

<note>
    <to>Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
</note>

```

```xml
<?xml version='1.0' encoding='UTF-8'?>
<note>
    <to>Tove</to>
    <from>Jani</from>
    <heading>Reminder</heading>
    <body>Don't forget me this weekend!</body>
</note>
```

Expected behavior The content in code blocks should be preserved as it is.

Screenshots

Environment Info 0.15.7

Additional context Since markdown is first converted to html, adding extensions=['fenced_code'] to markdown parser solves the issue. Or a better way is to make the extensions list to be a configurable parameter. https://github.com/Unstructured-IO/unstructured/blob/f440eb476cf75d6109e8a3719cadf893529dcef8/unstructured/partition/md.py#L109

MthwRobinson commented 2 months ago

Hi @cgjosephlee - Thanks for the report and the detailed reproduction steps. We'll take a look as soon as we're able. cc @scanny .