elastic / ember

Elastic Malware Benchmark for Empowering Researchers
Other
949 stars 277 forks source link

Problem when ember.create_vectorized_features(data_dir) #52

Closed wilsonalberto-git closed 4 years ago

wilsonalberto-git commented 4 years ago

Hello, I am using the ember-2018 data set, once I try to create the vectorized feature, I am getting an error: KeyError: 'datadirectories'

Vectorizing training set 0%| | 0/900000 [00:00<?, ?it/s]

RemoteTraceback Traceback (most recent call last) RemoteTraceback: """ Traceback (most recent call last): File "/anaconda/envs/azureml_py36/lib/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(*args, *kwds)) File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/ember/init.py", line 44, in vectorize_unpack return vectorize(args) File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/ember/init.py", line 31, in vectorize feature_vector = extractor.process_raw_features(raw_features) File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/ember/features.py", line 531, in process_raw_features feature_vectors = [fe.process_raw_features(raw_obj[fe.name]) for fe in self.features] File "/anaconda/envs/azureml_py36/lib/python3.6/site-packages/ember/features.py", line 531, in feature_vectors = [fe.process_raw_features(raw_obj[fe.name]) for fe in self.features] KeyError: 'datadirectories' """

The above exception was the direct cause of the following exception:

KeyError Traceback (most recent call last)

in ----> 1 ember.create_vectorized_features(data_dir, 2) 2 ember.create_metadata(data_dir) /anaconda/envs/azureml_py36/lib/python3.6/site-packages/ember/__init__.py in create_vectorized_features(data_dir, feature_version) 73 raw_feature_paths = [os.path.join(data_dir, "train_features_{}.jsonl".format(i)) for i in range(6)] 74 nrows = sum([1 for fp in raw_feature_paths for line in open(fp)]) ---> 75 vectorize_subset(X_path, y_path, raw_feature_paths, extractor, nrows) 76 77 print("Vectorizing test set") /anaconda/envs/azureml_py36/lib/python3.6/site-packages/ember/__init__.py in vectorize_subset(X_path, y_path, raw_feature_paths, extractor, nrows) 58 argument_iterator = ((irow, raw_features_string, X_path, y_path, extractor, nrows) 59 for irow, raw_features_string in enumerate(raw_feature_iterator(raw_feature_paths))) ---> 60 for _ in tqdm.tqdm(pool.imap_unordered(vectorize_unpack, argument_iterator), total=nrows): 61 pass 62 /anaconda/envs/azureml_py36/lib/python3.6/site-packages/tqdm/std.py in __iter__(self) 1128 1129 try: -> 1130 for obj in iterable: 1131 yield obj 1132 # Update and possibly print the progressbar. /anaconda/envs/azureml_py36/lib/python3.6/multiprocessing/pool.py in next(self, timeout) 733 if success: 734 return value --> 735 raise value 736 737 __next__ = next # XXX KeyError: 'datadirectories' **################### Requirements seem to be installed correctly:** pip install -r requirements.txt Requirement already satisfied: lief>=0.9.0 in /anaconda/envs/azureml_py36/lib/python3.6/site-packages (from -r requirements.txt (line 1)) (0.9.0) Requirement already satisfied: tqdm>=4.31.0 in /anaconda/envs/azureml_py36/lib/python3.6/site-packages (from -r requirements.txt (line 2)) (4.48.0) Requirement already satisfied: numpy>=1.16.3 in /anaconda/envs/azureml_py36/lib/python3.6/site-packages (from -r requirements.txt (line 3)) (1.16.6) Requirement already satisfied: pandas>=0.24.2 in /anaconda/envs/azureml_py36/lib/python3.6/site-packages (from -r requirements.txt (line 4)) (1.1.0) Requirement already satisfied: lightgbm>=2.2.3 in /anaconda/envs/azureml_py36/lib/python3.6/site-packages (from -r requirements.txt (line 5)) (2.3.0) Requirement already satisfied: scikit-learn>=0.20.3 in /anaconda/envs/azureml_py36/lib/python3.6/site-packages (from -r requirements.txt (line 6)) (0.20.3) Requirement already satisfied: pytz>=2017.2 in /anaconda/envs/azureml_py36/lib/python3.6/site-packages (from pandas>=0.24.2->-r requirements.txt (line 4)) (2019.3) Requirement already satisfied: python-dateutil>=2.7.3 in /anaconda/envs/azureml_py36/lib/python3.6/site-packages (from pandas>=0.24.2->-r requirements.txt (line 4)) (2.8.1) Requirement already satisfied: scipy in /anaconda/envs/azureml_py36/lib/python3.6/site-packages (from lightgbm>=2.2.3->-r requirements.txt (line 5)) (1.4.1) Requirement already satisfied: six>=1.5 in /anaconda/envs/azureml_py36/lib/python3.6/site-packages (from python-dateutil>=2.7.3->pandas>=0.24.2->-r requirements.txt (line 4)) (1.12.0) **Thanks, appreciate your help**
wilsonalberto-git commented 4 years ago

The issue was fixed after unzipped the ember-2018 file and it is working now.