OpenThaiGPT / openthaigpt-pretraining

Apache License 2.0
21 stars 10 forks source link

refactor(model): don't use chunk in dataset tokenized #206

Closed boss-chanon closed 1 year ago

boss-chanon commented 1 year ago

Why this PR

don't save tokenized to chunk

Changes

Related Issues

Close #

Checklist

boat1603 commented 1 year ago

@boss-chanon please resolved conflict

boss-chanon commented 1 year ago

plz review again because i convert to chunk because large dataset will out of memory if don't chunk but i fix load tokenized dataset to can load chunk data