Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.37k stars 572 forks source link

feat/code-snippets-context #3271

Open asm0dey opened 1 week ago

asm0dey commented 1 week ago

Is your feature request related to a problem? Please describe. In a way. I'm trying to build a RAG-based assistant for our documentation. Our documentation is code-heavy (since we develop a JDK distribution). I really want code snippets to appear only in the context of a text—by themselves, they're useless.

Describe the solution you'd like The perfect solution, I think, is for unstructured to recognize code snippets and have settings to put them in context. For example, code should always include at least one paragraph before and one paragraph after.

Describe alternatives you've considered I tried to play with max_characters parameter as well as some others, but eventually I always end up with teared code blocks without context somewhere. Another alternative would be probably, to cleanly split a document by titles, not caring section sizes (obviously code can be big)

scanny commented 1 week ago

@asm0dey what is the source file-format you're partitioning from? HTML? Markdown maybe?

I think the first prerequisite would be recognizing and distinguishing code blocks during partitioning, and that would depend on how they were identified in each particular document format.

asm0dey commented 1 week ago

Sorry, forgot to mention that it's markdown!