Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.37k stars 572 forks source link

Is there a way to convert text files to markdown format ? #3265

Closed shamanez closed 1 week ago

shamanez commented 1 week ago

I played with the partition function, which can extract important things. I just wonder if there is a way to convert the final text into the markdown format.

scanny commented 1 week ago

@shamanez can you say more about your use case for that?

Also, which input formats are you partitioning? Just absolutely plain text?

If so, seems like you could get a lot of the way there with something like:

def iter_markdown_lines(elements: list[Element]) -> Iterator[str]:
    for e in elements:
        if e.category == "Title":
            yield f"# {e.text}"
        elif e.category == "ListItem":
            yield f"- {e.text}"
        else:
            yield e.text

md = "\n".join(iter_markdown_lines(elements)
shamanez commented 1 week ago

Thanks a lot for getting back to me.

Just wonder what is the best text format to do continual pre-training with a model like Llama3.

I thought most of the models used formats like markdown or JSON during their pre-training.

Am I missing something here?

scanny commented 1 week ago

You can convert Unstructured document-elements into JSON:

from unstructued.partition.text import partition_text
from unstructured.staging.base import elements_to_json

elements = partition_text(document.txt)
print(f"{elements_to_json(elements, indent=2)}")

In fact, when you use the API for partitioning and/or chunking that's the format you get back.

I'm not much of an expert in Llama3 or pre-training so can't advise you on the particulars, but the JSON produced contains everything the elements do so I expect it would be fairly straightforward to transform that JSON into something Llama3 will like.

shamanez commented 1 week ago

Thanks a lot. Will look into this.

On Sat, 22 Jun 2024 at 9:15 AM, Steve Canny @.***> wrote:

You can convert Unstructured document-elements into JSON:

from unstructued.partition.text import partition_textfrom unstructured.staging.base import elements_to_json elements = partition_text(document.txt)print(f"{elements_to_json(elements, indent=2)}")

In fact, when you use the API for partitioning and/or chunking that's the format you get back.

I'm not much of an expert in Llama3 or pre-training so can't advise you on the particulars, but the JSON produced contains everything the elements do so I expect it would be fairly straightforward to transform that JSON into something Llama3 will like.

— Reply to this email directly, view it on GitHub https://github.com/Unstructured-IO/unstructured/issues/3265#issuecomment-2183474080, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGWKMF55EWG4STCW7T3ZISJY7AVCNFSM6AAAAABJU2XNHSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBTGQ3TIMBYGA . You are receiving this because you were mentioned.Message ID: @.***>

scanny commented 1 week ago

No worries @shamanez :)

Closing this for now as not actionable but feel free to reopen if needed.