VinAIResearch / PhoGPT

PhoGPT: Generative Pre-training for Vietnamese (2023)
Apache License 2.0
739 stars 67 forks source link

Need some information about the tokenizer #29

Open xtfocus opened 4 months ago

xtfocus commented 4 months ago

Hi, thanks for the great work.

I'm new to the Vietnamese language modeling scene. I came across some major articles from 2019-2021 where people perceived word segmentation as the standard step before tokenization, which I appreciate but still not quite sure if it is actually necessary. Cannot find much information to answer that myself.

Then I take a look into your paper: you trained a BPE tokenization (sort of a sub-word tokenization). I have a few questions:

image

  1. Is it correct that word segmentation is not used at all to create PhoGPT? If that's correct, I would love some reasoning.
  2. You used segmentation in PhoBert. Why didn't you use BPE back then?