Closed yaeldekel closed 5 years ago
Most of the text models in TensorFlow (and in any other DNN platform in general) uses an embedding layer to handle text. This is against the bag-of-word model approach where a vector is formed for the words/characters in the text. The indices of the vector refer to the words/characters and the values represent the TF/TF-IDF or any other scores computed for words/characters.
The bag-of-word model requires vectors to be represented in sparse format because number of words/characters appearing in the text is very large. However, when using models with embedding layers, sparse format is not needed because input to embedding layers is not typically that large. So, we are fine with dense format.
However, when working with text models, I found out following issues.
TermLookupTransform
and TermTransform
both did not work.I currently don't see any issue with retrieving outputs from TensorFlow. I will write more if I encounter other issues.
The TermlookupTransform
does not seem to operate on vectors while TermTransform
outputs Key type which cannot be used in TensorflowTransform
currently.
This issue currently blocks the UI tooling for ML.NET. Can this issue be addressed sooner to unblock tooling work.
@asthana86, This issue cannot be closed currently because it requires a few more features to be developed in ML.Net like padding and trimming transform. Can you please let me know how this is blocking tooling work? I may be helpful in unblocking you then.
Text usage would be handled here: https://github.com/dotnet/machinelearning/issues/2545
The following example show use-cases for text classification and string input/output. Most of the point raised in this issues are covered now. I am closing it.
We should know how TF handles text inputs, and whether it supports sparse inputs.