Open angel-daza opened 8 months ago
Just to keep in mind: the tokenization done by a Transformer model does not align with human tokenization, so I am not sure if this issue is about getting a list of human-readable words and punctuations, or about the internal Transformer tokens.
Basically, the issue is that if we now save records in the database, the Encodings object cannot be saved. Therefore, if a Passage is retrieved from the DB, we currently have no option but to run again the transformer encoding step only to repopulate the Passage.tokenization
parameter, which is required by the clustering step (among others, I suppose).
The easiest is to use your current internal methods that retrieve character spans in the text and reconstruct as much as possible the tokens (most of the times after reconstructing the wordpieces are readable enough). I don't follow 100% your code yet, but I am guessing there is some "token segmentation" you are doing for the TF-IDF step
to retrieve the top words that represent the "cluster topic".
A more elegant and "human-readable" alternative could be to use a word tokenizer (e.g., stanza) to create word tokens as a first step, save the tokens in the metadata and also feed them into the transformer when computing the embeddings, this way when we retrieve a passage form the DB the tokenized version is already there. This could also give us the option to test with "real sentences", although sentence segmentation might harm us instead of helping.
Both options would need some extra thinking to do it in a clean way of course.
The code is a bit convoluted indeed because I was experimenting with different approaches. The words for TF-IDF are generated in the Passage.words()
method, which uses either the Encoding
object stored inself.tokenization
or a simple whitespace tokenizer.
At some point, I was also considering another solution (such as Stanza), but figured it might be overkill. I think it mostly depends on the use case, as in why do users need to see tokens. When you mention database interaction, I reckon you might be referring to search, but then the database might apply its own search functionality again, including tokenization.
To take into account: I found out that the Passage.words()
method returns "duplicated words" in the sense that e.g. if one word was internally split into three subtokens, the iterable will contain the same full word three times (because each subtoken is mapped back to the same full word). This should be changed since it might considerably bias the TF-IDF calculations (longer words will get more counts).
To take into account: I found out that the
Passage.words()
method returns "duplicated words" in the sense that e.g. if one word was internally split into three subtokens, the iterable will contain the same full word three times (because each subtoken is mapped back to the same full word). This should be changed since it might considerably bias the TF-IDF calculations (longer words will get more counts).
Interesting! Two questions:
duurzaam
+heid
, I would expect that both TF and DF for heid
increase (i.e. lower TF-IDF), whereas duurzaam
does not (ie. TF-IDF as desired). I might be missing something, though.As a side note, the TF-IDF calculation is done through the Scikit-Learn implementation.
True, for the case of whitespace tokenization it is safe (which I assume is the current case?). And for the second case, your example makes sense since the subtokens are meaningful, but I would assume a significant percentage of non-meaningful token pieces.
Also, as the code stands right now, if the word duurzaamheid
is split into duurzaam
+ heid
the method returns the subtokens already mapped back to their full words, thus: [duurzaamheid
, duurzaamheid
]. And in the case it was split in duur
+ zaam
+ heid
the returned list would be [duurzaamheid
, duurzaamheid
, duurzaamheid
]
Currently, the
Model
class does several things in thecompute_embeddings()
step:It is easy to return both full and compressed embeddings, but the tokenization of each passage is assigned internally only through this method. The parameter
Passage.tokenization
exists but it holds anEncoding
. It would be useful to have a function that can return the passage text tokenized as a list of strings (for example to interact with a database more smoothly).