OpenThaiGPT / openthaigpt-pretraining

Apache License 2.0
21 stars 10 forks source link

Initial pantip preprocessing code #209

Closed boat1603 closed 1 year ago

boat1603 commented 1 year ago

Why this PR

This PR is about add clean_html_tags and read_jsonl for Pantip 2G data.

Changes

Related Issues

Close #

Checklist