Describe the bug
HTML and XML code blocks in markdown are not parsed properly.
Results:
HTML Example
```html
Hello, World!
This is a simple HTML example.
```
XML Example
xml <note> <to>Tove</to> <from>Jani</from> <heading>Reminder</heading> <body>Don't forget me this weekend!</body> </note>
```xml
```
```xml
```
HTML tags are not preserved.
XML code is malformed. The blank lines may erase the context.
<?xml version='1.0' encoding='UTF-8'?> line breaks the parser.
Traceback (most recent call last):
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/test.py", line 14, in <module>
elems = partition_html(
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/documents/elements.py", line 605, in wrapper
elements = func(*args, **kwargs)
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 706, in wrapper
elements = func(*args, **kwargs)
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 662, in wrapper
elements = func(*args, **kwargs)
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
elements = func(*args, **kwargs)
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/partition.py", line 103, in partition_html
elements = list(
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/lang.py", line 475, in apply_lang_metadata
elements = list(elements)
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/partition.py", line 222, in iter_elements
yield from cls(opts)._iter_elements()
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/partition.py", line 229, in _iter_elements
for e in self._main.iter_elements():
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/parser.py", line 361, in iter_elements
yield from self._element_from_text_or_tail(block_item.tail or "", q)
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/parser.py", line 377, in _element_from_text_or_tail
for node in self._iter_text_segments(text, q):
File "/Users/joseph.lee/Documents/repos_ncb/test_rag/.venv/lib/python3.10/site-packages/unstructured/partition/html/parser.py", line 421, in _iter_text_segments
while q and q[0].is_phrasing:
AttributeError: 'lxml.etree._ProcessingInstruction' object has no attribute 'is_phrasing'
To Reproduce
## HTML Example
```html
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Sample HTML</title>
</head>
<body>
<h1>Hello, World!</h1>
<p>This is a simple HTML example.</p>
</body>
</html>
```
## XML Example
```xml
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
```
```xml
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
```
```xml
<?xml version='1.0' encoding='UTF-8'?>
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
```
Expected behavior
The content in code blocks should be preserved as it is.
Describe the bug HTML and XML code blocks in markdown are not parsed properly.
Results:
<?xml version='1.0' encoding='UTF-8'?>
line breaks the parser.To Reproduce
Expected behavior The content in code blocks should be preserved as it is.
Screenshots
Environment Info 0.15.7
Additional context Since markdown is first converted to html, adding
extensions=['fenced_code']
to markdown parser solves the issue. Or a better way is to make the extensions list to be a configurable parameter. https://github.com/Unstructured-IO/unstructured/blob/f440eb476cf75d6109e8a3719cadf893529dcef8/unstructured/partition/md.py#L109