OpenThaiGPT / openthaigpt-pretraining

Apache License 2.0
21 stars 10 forks source link

[SIIT] Crawl alternative data #216

Open ArthurMinovsky opened 1 year ago

ArthurMinovsky commented 1 year ago

Crawl data

  1. Collect any data that suitable for our LLM
  1. Data format should be

`from datasets import load_dataset

thsum = load_dataset('thaisum')`

Output:

ArthurMinovsky commented 1 year ago

Example of data source :