Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
9.22k stars 764 forks source link

bug/Extensions .mdx and .markdown not supported #3670

Open butasebi opened 1 month ago

butasebi commented 1 month ago

Describe the bug The extensions .mdx and .markdown are being transformed to FileType.UNK when being passed to unstructured.file_utils.filetype.detect_filetype

To Reproduce from unstructured.file_utils.filetype import detect_filetype print(detect_filetype("file.mdx")) print(detect_filetype("file.markdown"))

Expected behavior The expected behavior should be either to have them go into a FileType.MDX respectively FileType.MARKDOWN (just like XLS XLSX) or at least have them be FileType.MD