dsp-uga / elizabeth

Scalable malware detection
MIT License
0 stars 0 forks source link

MultipleNGram Transformer #24

Open cbarrick opened 6 years ago

cbarrick commented 6 years ago

We want a MultipleNGram class. Here's my understanding of writing custom Transformers:

  1. Inherit pyspark.ml.Transformer.
  2. Override the _transform method.
  3. In our case, it must support the params inputCol and outputCol. These should be attributes of the object of type pyspark.ml.params.Param.
cbarrick commented 6 years ago

These might be useful:

I don't see documentation for them, but I saw them used in this SO answer: https://stackoverflow.com/questions/32331848/create-a-custom-transformer-in-pyspark-ml

cbarrick commented 6 years ago

Here's an example Transformer I wrote for the CNN:

from pyspark import keyword_only
from pyspark.ml import Transformer
from pyspark.ml.param.shared import Param, HasInputCol, HasOutputCol
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, FloatType

class HexToFloat(Transformer, HasInputCol, HasOutputCol):
    @keyword_only
    def __init__(self, inputCol='features', outputCol='transform'):
        super().__init__()
        self.inputCol = Param(self, 'inputCol', 'input column')
        self.outputCol = Param(self, 'outputCol', 'output column')
        self.setParams(inputCol=inputCol, outputCol=outputCol)

    @keyword_only
    def setParams(self, inputCol=None, outputCol=None):
        kwargs = self._input_kwargs
        return self._set(**kwargs)

    def _transform(self, df):
        in_col = self.getInputCol()
        out_col = self.getOutputCol()
        t = ArrayType(FloatType())
        f = lambda bytes: [int(b, 16) / 255 for b in bytes]
        f = udf(f, t)
        return df.withColumn(out_col, f(df[in_col]))
whusym commented 6 years ago

How's the performance?

cbarrick commented 6 years ago

Not there yet. This is for the CNN. Haven't tried multiple n-grams on anything.

whusym commented 6 years ago

I think we can close this too, since CNN doesn't really work and we may have better options for combining features

cbarrick commented 6 years ago

It'd be nice to have this feature.

We don't have to have all of our issues closed for submission. The issues actually give us a good place to document the current state of the project. I'd say leave it open.

whusym commented 6 years ago

Right. Maybe people can help us on this in the future if they are interested.