elastic / ml-cpp

Machine learning C++ code
Other
150 stars 62 forks source link

[ML] Use hashing for categorical data #2199

Open valeriy42 opened 2 years ago

valeriy42 commented 2 years ago

Model inference definition can potentially reveal personally identifiable information used in categorical encoding maps. This is usually not a problem since the access permissions for reviewing the model definitions are the same as for reviewing the training datasets where this PII occurred.

However, there is no reason to have original categorical strings stored in the model. For the learning algorithm, it is sufficient to use the distinct representation of the categories produced by a cryptographic hash function.

Note that the encodings need to be unique only within the same feature, which reduces the complexity of the hash function

valeriy42 commented 2 years ago

Both Google and Facebook use cryptographic hash function to encode PII. Google suggests using SHA256+salt while Facebook does not explicitly mention the CHF algorithm.

valeriy42 commented 2 years ago

One could use header-only C++ library for sha256 generation.