elastic / ember

Elastic Malware Benchmark for Empowering Researchers
Other
948 stars 277 forks source link

Bug fix: FeatureHasher’s transform expects a list of list of strings #109

Open PFGimenez opened 1 year ago

PFGimenez commented 1 year ago

A recent version of scikit-learn added a check to the input data of the function "transform" in FeatureHasher. The details are in this pull request: https://github.com/scikit-learn/scikit-learn/pull/25094.

This check fails when transform in invoked by ember on this line (in ember/features.py, line 192): entry_name_hashed = FeatureHasher(50, input_type="string").transform([raw_obj['entry']]).toarray()[0] because [raw_obj['entry']] is a list of strings, not a list of list of strings.

This pull request changes this call to transform by wrapping everything in a list. I am not sure of the soundness of my fix, so I encourage the reviewer to have a deeper look.

PFGimenez commented 1 year ago

Update: I see that a similar PR has already been proposed… feel free to close mine.