Text transformation (Please review this one LAST)

georgian-io-archive / foreshadow

An automatic machine learning system

Apache License 2.0

29 stars 2 forks source link

The latest change includes a TruncatedSVD step in the text transformation pipeline. The tricky part is to decide how many components to use. As for now, the value is set to:

The number of non-text features
A default value of 20 (This is arbitrary. We can discuss further if needed).
The number of features from TFIDF output minus 1. This is required by the SVD code. The number of components must be smaller than the number of features from the input, which is the sparse matrix output from TFIDF.

As for performance, I tested on the 20news_group data described here: https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html Using Foreshadow running only 1 minute, it has a higher classification accuracy than the NB but lower than the SVM. If there is interest to do more test, I can let TPOT run longer. Let me know your thoughts.

georgian-io-archive / foreshadow

Text transformation (Please review this one LAST) #210

Description