elastic / ember

Elastic Malware Benchmark for Empowering Researchers
Other
949 stars 277 forks source link

python3.5 #33

Closed Natruel closed 3 years ago

Natruel commented 4 years ago

Can the code run under the python3.5 ? When im done this ,these seems a question that it reported that the ember need lief 0.8.3, while the requirements.txt is written," lief = 0.9.0".

mrphilroth commented 4 years ago

The error you are referring to is displayed when you have lief version 0.9.0 installed, but are trying to generate ember version 1 features. You need lief version 0.8.3 if you want to generate ember version 1 features. But if you just want to work with the latest feature set (ember version 2 features), then you can stick with lief version 0.9.0.

I believe that ember can be run under python 3.5, but I haven't tried it. You will only run into trouble with f-strings if you hit the lief error I mention above.

Natruel commented 4 years ago

Thank for your reply! Do you mean that the default value of feature_version is 2 when i run the "train_ember";because when i read the code ,i find this `

def init(self, feature_version=2):

    self.features = [
        ByteHistogram(),
        ByteEntropyHistogram(),
        StringExtractor(),
        GeneralFileInfo(),
        HeaderFileInfo(),
        SectionInfo(),
        ImportsInfo(),
        ExportsInfo()
    ]

` but i use the default value to run the code and don't specify the parameter of the feature_version, it still occurs the mistake that i mentioned in the question. So maybe there is any other reasons? Or i just should run it under the python3.6 and i will try it and obtain the result.

And there is another question i want to question.When i run the code ,“ember.create_vectorized_features("D:\study\untitled1\ember_data")”, on the Windows platform, it occurs a mistake whcih looks like about the multiple processes, after i read the code, (the mistake looks like this: 0%| | 0/900000 [00:00<?, ?it/s]multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "D:\study\anaconda\lib\multiprocessing\pool.py", line 119, in worker result = (True, func(*args, **kwds)) File "D:\study\anaconda\ember\__init__.py", line 44, in vectorize_unpack return vectorize(*args) File "D:\study\anaconda\ember\__init__.py", line 31, in vectorize feature_vector = extractor.process_raw_features(raw_features) File "D:\study\anaconda\ember\features.py", line 522, in process_raw_features feature_vectors = [fe.process_raw_features(raw_obj[fe.name]) for fe in self.features] File "D:\study\anaconda\ember\features.py", line 522, in <listcomp> feature_vectors = [fe.process_raw_features(raw_obj[fe.name]) for fe in self.features] KeyError: 'datadirectories' """ ) i Commented out the:

for _ in tqdm.tqdm(pool.imap_unordered(vectorize_unpack, argument_iterator), total=nrows): pass

(the code in the init.py, line 62,63) because i think it is useless and it just relates to display progress bar. After it ,the code can run;but when i run the code: ` import ember

def test(): ember.create_vectorized_features("D:\study\untitled1\ember_data") X_train, y_train, X_test, y_test = ember.read_vectorized_features("D:\study\untitled1\ember_data") print(X_train)

if name == 'main': test() `

and i find that all of value are 0.And i don't know what is wrong.

Natruel commented 4 years ago

Why all of the value are 0 is that it seems that i don't read the data into the np.memmap array.But i still don't know what is wrong.

mrphilroth commented 4 years ago

The default value for feature version is 2.

The KeyError: 'datadirectories' error is solved here: https://github.com/endgameinc/ember/issues/28#issuecomment-523456373

All the downloads are listed here: https://github.com/endgameinc/ember/#download You must use one that has feature version 2 available in it. The original download from 1.5 years ago only has enough information available for feature version 1.

Natruel commented 4 years ago

I can run the code and Vectorize data ;but when i read the data, it occurs a mistake that it prompts insufficient memory. After I read the code and look up some information, it seems that np.memmap is used to solve this problem ,but it dosen't work and i don't know how to make it work even after i read the official document. And i learn about some types of reading a large amount of data,but i don't know how to read the data when its format is the "dat".So i just wan to know how to solve this problem.Because i need these data to train other models and i have to read them all ,or i can choose a part of them to train the model.But both of them need to read the data.

mrphilroth commented 4 years ago

The memmap function doesn't use your memory. This line will read the data into memory. That's probably where you're running into insufficient memory errors: https://github.com/endgameinc/ember/blob/master/ember/__init__.py#L211