What exactly is V_S[t]?

kirefu commented 3 years ago

From my understanding of the paper, S[t] for a particular value of t is the vocab size, so V_S[t] set of all the possible vocabularies of a given corpus of size t. i.e. for a given element v in V_S[t = x], |v| = x.

In the paper, you state:

"To this end, let T ∈ V_S[t] be the vocabulary containing top S[t] most frequent tokens"

However, in the code you define tokens to be the BPE merge rules, instead of the actual tokens of the BPEised text, and so T is instead the top BPE subword bigrams for a given value of t. I'm not sure that this the same thing as the "most frequent tokens" as described in the paper. For example, in English "t h" is normally the first bpe merge rule, but the token "th" would definitely not be the most frequent token in the text, as many of this token would definitely get merged with other tokens, thus decreasing in frequency.

I would be grateful if you could shed some light on this.

Best, Faheem

Jingjing-NLP commented 3 years ago

Hi Faheem,

Sorry for the late reply. I would like to explain this question.

In implementation, we adopt merge actions as token candidates to initialize our algorithm. Merge actions are actually a special kind of token candidate. For example, "t h" defines a token candidate "th". Since merge actions are strictly ranked by frequency, the merge action with the top frequency must be the sub-word candidate (except for characters) with the top frequency.

S[t] contains the most frequent tokens, and thus it is ok to use merge actions to initialize S[t].

"For example, in English "t h" is normally the first bpe merge rule, but the token "th" would definitely not be the most frequent token in the text, as many of this token would definitely get merged with other tokens, thus decreasing in frequency. ""

We also consider this question. In implementation, the frequency of "th" is the sum of all spans containing "th". You can refer to lines 48-74 in get_token.py for more details.

Hope my response can address you problems. If you have other questions, please feel free to let us know.

Thanks, Jingjing

kirefu commented 3 years ago

Hi Jingjing,

"S[t] contains the most frequent tokens" - I'm getting confused on notation, as I thought S[t] was just a number from a sequence. Do you mean V_S[t]? And if so, I thought you were initialising T ∈ V_S[t], which you define as "the vocabulary containing top S[t] most frequent tokens" - is this the most frequent tokens with respect to the concatenated (and already) BPE-ised^1 training set?

^1 - (where no. of BPE merges is the max value for S[t])

Jingjing-NLP commented 3 years ago

Hi Jingjing,

"S[t] contains the most frequent tokens" - I'm getting confused on notation, as I thought S[t] was just a number from a sequence. Do you mean V_S[t]?

Yes, S[t] should be V{S[t]}. It is a typo. Let $\mathbb{T}\in \mathbb{V}{\bm S[t]}$ be the vocabulary containing top $S[t]$ most frequent tokens. Strictly speaking, It should be $\mathbb{T}$.

Jingjing-NLP commented 3 years ago

And if so, I thought you were initialising T ∈ V_S[t], which you define as "the vocabulary containing top S[t] most frequent tokens" - is this the most frequent tokens with respect to the concatenated (and already) BPE-ised^1 training set?

^1 - (where no. of BPE merges is the max value for S[t])

I do not get "BPE-ised^1 training set". Would you please explain this a little bit more?

kirefu commented 3 years ago

sorry the "^1" was meant to act as a footnote, so the sentence was not too clunky.

What I meant is that the training data that is being fed into your VOLT script is not the original, but a version that has already undergone BPE

From the README:

"python3 ../ot_run.py --source_file bpeoutput/source.file ......."

I think what is confusing here is the use of the word "tokens". Generally, the standard definition of a token is a sequence of characters delimited by whitespace. And so "the vocabulary containing top S[t] most frequent tokens" , I would interpret as splitting your training set (BPEed as per your README) on whitespace, and counting the frequency of each subword, then ordering it by frequency. However, if I understand it correctly, you've defined "top frequent tokens" to be the the subwords of the most frequent bpe merge rules, but I do not see this mentioned in the paper. The "frequency" of the tokens (merge rules) is calculated by how many times the span appears in the bpeoutput/source.file text. Why the BPEised training data and not the original?

Jingjing-NLP commented 3 years ago

Hi Faheem,

We use "token" to represent item candidate to be included in a vocabulary. We use merge actions as item candidates here.

Strictly speaking, we should take the raw texts as inputs to calculate frequency. BPE segmentation only adds special separators into raw texts. It is ok to use this file as an alternative. For example, given a sentence "abcbdceabab" and segmented sentence "abc bd ceabab", we can get the same frequency for item "ab". Therefore, we directly use the segmented texts as inputs here for convenience since one can get segmented texts and vocabulary at the same time via segmentation toolkits.

We will add these explanations in to the paper. Thanks for your comments again!

Best, Jingjing

kirefu commented 3 years ago

Continuing from your example, would we get the same frequency for "cb" i.e. where the merge rules are now split over segments?

Jingjing-NLP commented 3 years ago

Given this sequence "abc bd ceabab", "cb" is not a valid merging action. BPE starts from characters and follows merging actions to merge texts. Therefore, given a merging rule, BPE always merges them together. For example, given a merging rule 'a b', "ab" will always be merged together.

kirefu commented 3 years ago

"Strictly speaking, we should take the raw texts as inputs to calculate frequency" - if you used raw text, "cb" in "abcbdceabab" would be counted, hence there is a difference between using raw text as input and a BPEised version. "cb" can definitely be a valid merge action downstream. e.g. just because "think" is segmented as "th ink", does not make "hi" an invalid merge action for other sequences.

Jingjing-NLP / VOLT

What exactly is V_S[t]? #20