Decouple Tokenization - Githubissues

angel-daza commented 8 months ago

Currently, the Model class does several things in the compute_embeddings() step:

The raw text is given to an encoder
Tokenization happens
Embeddings are computed
UMAP is used to return the compressed 2-dimensional vectors.

It is easy to return both full and compressed embeddings, but the tokenization of each passage is assigned internally only through this method. The parameter Passage.tokenization exists but it holds an Encoding. It would be useful to have a function that can return the passage text tokenized as a list of strings (for example to interact with a database more smoothly).

carschno commented 8 months ago

Just to keep in mind: the tokenization done by a Transformer model does not align with human tokenization, so I am not sure if this issue is about getting a list of human-readable words and punctuations, or about the internal Transformer tokens.

angel-daza commented 8 months ago

Basically, the issue is that if we now save records in the database, the Encodings object cannot be saved. Therefore, if a Passage is retrieved from the DB, we currently have no option but to run again the transformer encoding step only to repopulate the Passage.tokenization parameter, which is required by the clustering step (among others, I suppose).

The easiest is to use your current internal methods that retrieve character spans in the text and reconstruct as much as possible the tokens (most of the times after reconstructing the wordpieces are readable enough). I don't follow 100% your code yet, but I am guessing there is some "token segmentation" you are doing for the TF-IDF step to retrieve the top words that represent the "cluster topic".

A more elegant and "human-readable" alternative could be to use a word tokenizer (e.g., stanza) to create word tokens as a first step, save the tokens in the metadata and also feed them into the transformer when computing the embeddings, this way when we retrieve a passage form the DB the tokenized version is already there. This could also give us the option to test with "real sentences", although sentence segmentation might harm us instead of helping.

Both options would need some extra thinking to do it in a clean way of course.

carschno commented 8 months ago

The code is a bit convoluted indeed because I was experimenting with different approaches. The words for TF-IDF are generated in the Passage.words() method, which uses either the Encoding object stored inself.tokenization or a simple whitespace tokenizer.

At some point, I was also considering another solution (such as Stanza), but figured it might be overkill. I think it mostly depends on the use case, as in why do users need to see tokens. When you mention database interaction, I reckon you might be referring to search, but then the database might apply its own search functionality again, including tokenization.

angel-daza commented 8 months ago

To take into account: I found out that the Passage.words() method returns "duplicated words" in the sense that e.g. if one word was internally split into three subtokens, the iterable will contain the same full word three times (because each subtoken is mapped back to the same full word). This should be changed since it might considerably bias the TF-IDF calculations (longer words will get more counts).

carschno commented 8 months ago

To take into account: I found out that the Passage.words() method returns "duplicated words" in the sense that e.g. if one word was internally split into three subtokens, the iterable will contain the same full word three times (because each subtoken is mapped back to the same full word). This should be changed since it might considerably bias the TF-IDF calculations (longer words will get more counts).

Interesting! Two questions:

This should not be an issue for the simple whitespace, right?
Assuming that a long word is split in the same way in all its occurrences, does it still bias the TF-IDF calculation (in a harmful way)? For instance, in the split duurzaam+heid, I would expect that both TF and DF for heid increase (i.e. lower TF-IDF), whereas duurzaam does not (ie. TF-IDF as desired). I might be missing something, though.

As a side note, the TF-IDF calculation is done through the Scikit-Learn implementation.

angel-daza commented 8 months ago

True, for the case of whitespace tokenization it is safe (which I assume is the current case?). And for the second case, your example makes sense since the subtokens are meaningful, but I would assume a significant percentage of non-meaningful token pieces.

Also, as the code stands right now, if the word duurzaamheid is split into duurzaam + heid the method returns the subtokens already mapped back to their full words, thus: [duurzaamheid, duurzaamheid]. And in the case it was split in duur + zaam + heid the returned list would be [duurzaamheid, duurzaamheid, duurzaamheid]

Semantics-of-Sustainability / tempo-embeddings

Decouple Tokenization #39