elastic / ember

Elastic Malware Benchmark for Empowering Researchers
Other
948 stars 277 forks source link

Extract Raw Features for Own Dataset #107

Closed nehege closed 1 year ago

nehege commented 1 year ago

This repository makes it easy to generate raw features and/or vectorized features from any PE file. Researchers can implement their own features, or even vectorize the existing features differently from the existing implementations.

Could you please provide steps or requirements to extract raw features from a different dataset? I'd like to create .jsonl file (see cropped image) for my dataset, however I am struggling to extract some spesific information such as histogram. Any suggestion or code sample would be great.

image

gxenos commented 1 year ago

What is the exact problem with the histogram?

The function that does what you are asking for is this: https://github.com/elastic/ember/blob/d97a0b523de02f3fe5ea6089d080abacab6ee931/ember/features.py#LL37C36-L37C36

nehege commented 1 year ago

What is the exact problem with the histogram?

The function that does what you are asking for is this: https://github.com/elastic/ember/blob/d97a0b523de02f3fe5ea6089d080abacab6ee931/ember/features.py#LL37C36-L37C36

I think I cannot see clearly, maybe I focus something wrong.

from ember import PEFeatureExtractor

extractor = PEFeatureExtractor()
extractor.raw_features('./files/13.exe')

I received an error:

Traceback (most recent call last):
  File "...\main.py", line 4, in <module>
    extractor.raw_features('./files/13.exe')
  File "...\.venv\lib\site-packages\ember\features.py", line 540, in raw_features
    lief_binary = lief.PE.parse(list(bytez))
TypeError: ['.', '/', 'f', 'i', 'l', 'e', 's', '/', '1', '3', '.', 'e', 'x', 'e']
nehege commented 1 year ago

I thought the function automatically reads the PE file. It is fixed by sending file as a parameter.

from ember import PEFeatureExtractor

extractor = PEFeatureExtractor()

with open('files/13.exe', 'rb') as f:
    print(extractor.raw_features(f.read()))