create_vectorized_features error

MLFlexer commented 1 year ago

I have problems running the following commands in python:

import ember
ember.create_vectorized_features("/data/ember2018/")

I have installed the dependencies and tried on docker with leif versions 0.9.0, 0.10.1 and i still get the same failure:

ember.create_vectorized_features("./ember/")
Vectorizing training set
  0%|                                                                                    | 0/900000 [00:00<?, ?it/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/opt/conda/lib/python3.8/site-packages/ember-0.1.0-py3.8.egg/ember/__init__.py", line 44, in vectorize_unpack
    return vectorize(*args)
  File "/opt/conda/lib/python3.8/site-packages/ember-0.1.0-py3.8.egg/ember/__init__.py", line 31, in vectorize
    feature_vector = extractor.process_raw_features(raw_features)
  File "/opt/conda/lib/python3.8/site-packages/ember-0.1.0-py3.8.egg/ember/features.py", line 552, in process_raw_features
    feature_vectors = [fe.process_raw_features(raw_obj[fe.name]) for fe in self.features]
  File "/opt/conda/lib/python3.8/site-packages/ember-0.1.0-py3.8.egg/ember/features.py", line 552, in <listcomp>
    feature_vectors = [fe.process_raw_features(raw_obj[fe.name]) for fe in self.features]
  File "/opt/conda/lib/python3.8/site-packages/ember-0.1.0-py3.8.egg/ember/features.py", line 192, in process_raw_features
    entry_name_hashed = FeatureHasher(50, input_type="string").transform([raw_obj['entry']]).toarray()[0]
  File "/opt/conda/lib/python3.8/site-packages/sklearn/utils/_set_output.py", line 142, in wrapped
    data_to_wrap = f(self, X, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/sklearn/feature_extraction/_hash.py", line 170, in transform
    raise ValueError(
ValueError: Samples can not be a single string. The input must be an iterable over iterables of strings.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/lib/python3.8/site-packages/ember-0.1.0-py3.8.egg/ember/__init__.py", line 75, in create_vectorized_features
  File "/opt/conda/lib/python3.8/site-packages/ember-0.1.0-py3.8.egg/ember/__init__.py", line 60, in vectorize_subset
  File "/opt/conda/lib/python3.8/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/opt/conda/lib/python3.8/multiprocessing/pool.py", line 868, in next
    raise value
ValueError: Samples can not be a single string. The input must be an iterable over iterables of strings.
>>>

I seems from the error msg, that the input is not the same format as expected in the vectorizor? Any fix to this?

birkj commented 1 year ago

I have the same problem. @mrphilroth is this a common problem?

AhlemRn commented 1 year ago

i have the same problem , if you have fix it please tell me how

MLFlexer commented 1 year ago

i have the same problem , if you have fix it please tell me how

I have not been able to find a fix for this yet, although I have not spent a lot of time on this

keremgirenes commented 1 year ago

i had the same issue, downgraded python to 3.6 in my environment, worked like charm.

gparrella12 commented 1 year ago

A way to fix it is to replace: entry_name_hashed = FeatureHasher(50, input_type="string").transform([raw_obj['entry']]).toarray()[0] with: entry_name_hashed = FeatureHasher(50, input_type="string").transform([ [raw_obj['entry']] ]).toarray()[0]

in features.py at line 192. In this way an iterable over iterable over raw features is obtained, as transform() method require.

maciejskorski commented 1 year ago

Same problem. I started a fork to curate this repo. Also my PR #108 fixes the issue

KSroido commented 8 months ago

downgrade to py3.6will easily solve

mdaument commented 3 months ago

A way to fix it is to replace: entry_name_hashed = FeatureHasher(50, input_type="string").transform([raw_obj['entry']]).toarray()[0] with: entry_name_hashed = FeatureHasher(50, input_type="string").transform([ [raw_obj['entry']] ]).toarray()[0]

in features.py at line 192. In this way an iterable over iterable over raw features is obtained, as transform() method require.

Can anyone provide any insight on what the intended output for the entry name hash table is supposed to be?

Using it the way it's written with Python3.6 or earlier, the FeatureHasher hashes each character in the entry string. For example, if .text is the entry point, there are 4 bins populated in the returned hash table.

Using the fixed version, the FeatureHasher hashes the entire string, so an entry point string of .text will return a hash table with only one bin populated.

In the grand scheme of the model, I don't know if either way has much of an impact, but it would be good to know if the authors intended the hash table to be one way or the other.

elastic / ember

create_vectorized_features error #103