Investigate using text and sparse input in TensorFlow

yaeldekel commented 6 years ago

We should know how TF handles text inputs, and whether it supports sparse inputs.

zeahmed commented 6 years ago

Most of the text models in TensorFlow (and in any other DNN platform in general) uses an embedding layer to handle text. This is against the bag-of-word model approach where a vector is formed for the words/characters in the text. The indices of the vector refer to the words/characters and the values represent the TF/TF-IDF or any other scores computed for words/characters.

The bag-of-word model requires vectors to be represented in sparse format because number of words/characters appearing in the text is very large. However, when using models with embedding layers, sparse format is not needed because input to embedding layers is not typically that large. So, we are fine with dense format.

However, when working with text models, I found out following issues.

String as input/output is not supported at all in TensorFlowTransform. TensorFlowSharp also has limited functionality in this regard.
The modes that are not based in string inputs are composed of two set of resources.
1. Model file
2. Text resources such as dictionary to convert text items (words, characters) into vector of integers.
The conversion is a pre-processing step so we need to find out a way to convert text items into vector of integers. I tried using TermLookupTransform and TermTransform both did not work.
For the models that accept fixed length text input, we need to find out a way to trim and pad vectors so that appropriate sized vector can passed to TensorFlow. Variable sized inputs should not have an issue.

I currently don't see any issue with retrieving outputs from TensorFlow. I will write more if I encounter other issues.

zeahmed commented 6 years ago

The TermlookupTransform does not seem to operate on vectors while TermTransform outputs Key type which cannot be used in TensorflowTransform currently.

asthana86 commented 5 years ago

This issue currently blocks the UI tooling for ML.NET. Can this issue be addressed sooner to unblock tooling work.

zeahmed commented 5 years ago

@asthana86, This issue cannot be closed currently because it requires a few more features to be developed in ML.Net like padding and trimming transform. Can you please let me know how this is blocking tooling work? I may be helpful in unblocking you then.

Ivanidzo4ka commented 5 years ago

Text usage would be handled here: https://github.com/dotnet/machinelearning/issues/2545

zeahmed commented 5 years ago

The following example show use-cases for text classification and string input/output. Most of the point raised in this issues are covered now. I am closing it.

https://github.com/dotnet/machinelearning/blob/51b10fc6a004b08b3a07d046db10fa277a8cffac/docs/samples/Microsoft.ML.Samples/Dynamic/TensorFlow/TextClassification.cs#L8

https://github.com/dotnet/machinelearning/blob/51b10fc6a004b08b3a07d046db10fa277a8cffac/test/Microsoft.ML.Tests/ScenariosWithDirectInstantiation/TensorflowTests.cs#L1088

dotnet / machinelearning

Investigate using text and sparse input in TensorFlow #747