Closed shamanez closed 1 week ago
@shamanez can you say more about your use case for that?
Also, which input formats are you partitioning? Just absolutely plain text?
If so, seems like you could get a lot of the way there with something like:
def iter_markdown_lines(elements: list[Element]) -> Iterator[str]:
for e in elements:
if e.category == "Title":
yield f"# {e.text}"
elif e.category == "ListItem":
yield f"- {e.text}"
else:
yield e.text
md = "\n".join(iter_markdown_lines(elements)
Thanks a lot for getting back to me.
Just wonder what is the best text format to do continual pre-training with a model like Llama3.
I thought most of the models used formats like markdown or JSON during their pre-training.
Am I missing something here?
You can convert Unstructured document-elements into JSON:
from unstructued.partition.text import partition_text
from unstructured.staging.base import elements_to_json
elements = partition_text(document.txt)
print(f"{elements_to_json(elements, indent=2)}")
In fact, when you use the API for partitioning and/or chunking that's the format you get back.
I'm not much of an expert in Llama3 or pre-training so can't advise you on the particulars, but the JSON produced contains everything the elements do so I expect it would be fairly straightforward to transform that JSON into something Llama3 will like.
Thanks a lot. Will look into this.
On Sat, 22 Jun 2024 at 9:15 AM, Steve Canny @.***> wrote:
You can convert Unstructured document-elements into JSON:
from unstructued.partition.text import partition_textfrom unstructured.staging.base import elements_to_json elements = partition_text(document.txt)print(f"{elements_to_json(elements, indent=2)}")
In fact, when you use the API for partitioning and/or chunking that's the format you get back.
I'm not much of an expert in Llama3 or pre-training so can't advise you on the particulars, but the JSON produced contains everything the elements do so I expect it would be fairly straightforward to transform that JSON into something Llama3 will like.
— Reply to this email directly, view it on GitHub https://github.com/Unstructured-IO/unstructured/issues/3265#issuecomment-2183474080, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEA4FGWKMF55EWG4STCW7T3ZISJY7AVCNFSM6AAAAABJU2XNHSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBTGQ3TIMBYGA . You are receiving this because you were mentioned.Message ID: @.***>
No worries @shamanez :)
Closing this for now as not actionable but feel free to reopen if needed.
I played with the partition function, which can extract important things. I just wonder if there is a way to convert the final text into the markdown format.