Original issue 104 created by ClearTK on 2009-08-05T15:37:07.000Z:
The following is from a posting by Olivier Grisel. This is something we
should consider learning about and working on.
I wondered if you were aware of the recent developments around sparsity
preserving feature-space dimensionality reduction based on hash
functions, a.k.a. the hashing trick:
All the three mentioned papers are worth reading in the right order,
the latest one is the most suited to cleartk implementation but lacks
the technical details of the first two. The most interesting point in
my opinion is it makes it possible to drop the requirements of
maintaining a huge vocabulary mapping in memory when using bag of
words based feature extraction.
I think feature hashing preprocessing would be a typical reusable
component to be provided by the cleartk project as preprocessing steps
for the ML input.
Original issue 104 created by ClearTK on 2009-08-05T15:37:07.000Z:
The following is from a posting by Olivier Grisel. This is something we should consider learning about and working on.
I wondered if you were aware of the recent developments around sparsity preserving feature-space dimensionality reduction based on hash functions, a.k.a. the hashing trick:
http://hunch.net/~jl/projects/hash_reps/
All the three mentioned papers are worth reading in the right order, the latest one is the most suited to cleartk implementation but lacks the technical details of the first two. The most interesting point in my opinion is it makes it possible to drop the requirements of maintaining a huge vocabulary mapping in memory when using bag of words based feature extraction.
I think feature hashing preprocessing would be a typical reusable component to be provided by the cleartk project as preprocessing steps for the ML input.