Open asm0dey opened 1 week ago
@asm0dey what is the source file-format you're partitioning from? HTML? Markdown maybe?
I think the first prerequisite would be recognizing and distinguishing code blocks during partitioning, and that would depend on how they were identified in each particular document format.
Sorry, forgot to mention that it's markdown!
Is your feature request related to a problem? Please describe. In a way. I'm trying to build a RAG-based assistant for our documentation. Our documentation is code-heavy (since we develop a JDK distribution). I really want code snippets to appear only in the context of a text—by themselves, they're useless.
Describe the solution you'd like The perfect solution, I think, is for unstructured to recognize code snippets and have settings to put them in context. For example, code should always include at least one paragraph before and one paragraph after.
Describe alternatives you've considered I tried to play with
max_characters
parameter as well as some others, but eventually I always end up with teared code blocks without context somewhere. Another alternative would be probably, to cleanly split a document by titles, not caring section sizes (obviously code can be big)