elastic / ember

Elastic Malware Benchmark for Empowering Researchers
Other
948 stars 277 forks source link

Define the information represented on the malware vector? #88

Open kevin3567 opened 2 years ago

kevin3567 commented 2 years ago

Hi,

I am wondering if there is a place where I can find what and how malware information are represented on the final 2381-length vector? For example: byte_histogram = malware_vector[:256] byte_entropy = malware_vector[256:512] ... The reason for this is that I am trying to train a DNN, but the performance of the model is very poor (and the loss keep getting nan). Thus, I am trying to find the features causing the issue.

Thanks in advance.

lkurlandski commented 2 years ago

This is a good question.

Reading the source code, we can see there are 9 different types of features that make up the entire vector. The lengths of each different category of feature can easily be determined from reading the class definition for each. Unfortunately, the order of how these different features are arranged within the end vector depends on what version of Python you are using. Python 3.7+ specifies that the order iterating though dict must be insertion order, so the order can be determined by reading the source code. In previous versions of Python, this was an implementation-level detail, so the order cannot be known.

Here is what I get from reading the code:

ByteHistogram ------------- [0000, 0256)
ByteEntropyHistogram --- [0256, 0512) StringExtractor ------------- [0512, 0616) GeneralFileInfo ------------ [0616, 0626) HeaderFileInfo ------------- [0626, 0688) SectionInfo ------------------ [0688, 0944) ImportsInfo ------------------ [0944, 2224) ExportsInfo ------------------ [2224, 2352) DataDirectories ------------ [2352, 2382)

naveennamani commented 2 years ago

Unfortunately, the order of how these different features are arranged within the end vector depends on what version of Python you are using. Python 3.7+ specifies that the order iterating though dict must be insertion order, so the order can be determined by reading the source code.

If that is the case, there should be a note on this point in the README.

The following simple code can be used for separating the raw_features into their constituent parts.

features = {
            'ByteHistogram': ByteHistogram(),
            'ByteEntropyHistogram': ByteEntropyHistogram(),
            'StringExtractor': StringExtractor(),
            'GeneralFileInfo': GeneralFileInfo(),
            'HeaderFileInfo': HeaderFileInfo(),
            'SectionInfo': SectionInfo(),
            'ImportsInfo': ImportsInfo(),
            'ExportsInfo': ExportsInfo()
    }
features_mapping = {}
feature_vector = [] # <-- load your feature vector here
for k, v in features.items():
    features_mapping[k] = feature_vector[:v.dim]
    feature_vector = feature_vector[v.dim:]
lkurlandski commented 2 years ago

Agreed it has the potential to be problematic. Upon further research, it appears that CPython >= 3.5 maintains dict order, although it is not PEP-mandated for 3.5 and 3.6. Don't quote me.