OpenThaiGPT / openthaigpt-pretraining

Apache License 2.0
21 stars 10 forks source link

[SuperAI] Implement concatenation code for pantip data #142

Open Chawak opened 1 year ago

Chawak commented 1 year ago
new5558 commented 1 year ago

Descripton: Create Pipeline for injecting JSONL Pantip Data and convert to our own JSONL Format

{Title}

{Description}

<sep_token> 

{Comment 1}

<sep_token> 

{Comment 2}

<sep_token> 

{Comment 3}

...

or

{Title}

{Description}

comment 1: {Comment 1}

comment 2: {Comment 2}

comment 3: {Comment 3}

...

...

Deliverables: