dotnet / machinelearning

ML.NET is an open source and cross-platform machine learning framework for .NET.
https://dot.net/ml
MIT License
9.02k stars 1.88k forks source link

Investigate using text and sparse input in TensorFlow #747

Closed yaeldekel closed 5 years ago

yaeldekel commented 6 years ago

We should know how TF handles text inputs, and whether it supports sparse inputs.

zeahmed commented 6 years ago

Most of the text models in TensorFlow (and in any other DNN platform in general) uses an embedding layer to handle text. This is against the bag-of-word model approach where a vector is formed for the words/characters in the text. The indices of the vector refer to the words/characters and the values represent the TF/TF-IDF or any other scores computed for words/characters.

The bag-of-word model requires vectors to be represented in sparse format because number of words/characters appearing in the text is very large. However, when using models with embedding layers, sparse format is not needed because input to embedding layers is not typically that large. So, we are fine with dense format.

However, when working with text models, I found out following issues.

I currently don't see any issue with retrieving outputs from TensorFlow. I will write more if I encounter other issues.

zeahmed commented 6 years ago

The TermlookupTransform does not seem to operate on vectors while TermTransform outputs Key type which cannot be used in TensorflowTransform currently.

asthana86 commented 5 years ago

This issue currently blocks the UI tooling for ML.NET. Can this issue be addressed sooner to unblock tooling work.

zeahmed commented 5 years ago

@asthana86, This issue cannot be closed currently because it requires a few more features to be developed in ML.Net like padding and trimming transform. Can you please let me know how this is blocking tooling work? I may be helpful in unblocking you then.

Ivanidzo4ka commented 5 years ago

Text usage would be handled here: https://github.com/dotnet/machinelearning/issues/2545

zeahmed commented 5 years ago

The following example show use-cases for text classification and string input/output. Most of the point raised in this issues are covered now. I am closing it.

https://github.com/dotnet/machinelearning/blob/51b10fc6a004b08b3a07d046db10fa277a8cffac/docs/samples/Microsoft.ML.Samples/Dynamic/TensorFlow/TextClassification.cs#L8

https://github.com/dotnet/machinelearning/blob/51b10fc6a004b08b3a07d046db10fa277a8cffac/test/Microsoft.ML.Tests/ScenariosWithDirectInstantiation/TensorflowTests.cs#L1088