Open cbarrick opened 6 years ago
These might be useful:
pyspark.ml.param.shared.HasInputCol
pyspark.ml.param.shared.HasOutputCol
I don't see documentation for them, but I saw them used in this SO answer: https://stackoverflow.com/questions/32331848/create-a-custom-transformer-in-pyspark-ml
Here's an example Transformer I wrote for the CNN:
from pyspark import keyword_only
from pyspark.ml import Transformer
from pyspark.ml.param.shared import Param, HasInputCol, HasOutputCol
from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, FloatType
class HexToFloat(Transformer, HasInputCol, HasOutputCol):
@keyword_only
def __init__(self, inputCol='features', outputCol='transform'):
super().__init__()
self.inputCol = Param(self, 'inputCol', 'input column')
self.outputCol = Param(self, 'outputCol', 'output column')
self.setParams(inputCol=inputCol, outputCol=outputCol)
@keyword_only
def setParams(self, inputCol=None, outputCol=None):
kwargs = self._input_kwargs
return self._set(**kwargs)
def _transform(self, df):
in_col = self.getInputCol()
out_col = self.getOutputCol()
t = ArrayType(FloatType())
f = lambda bytes: [int(b, 16) / 255 for b in bytes]
f = udf(f, t)
return df.withColumn(out_col, f(df[in_col]))
How's the performance?
Not there yet. This is for the CNN. Haven't tried multiple n-grams on anything.
I think we can close this too, since CNN doesn't really work and we may have better options for combining features
It'd be nice to have this feature.
We don't have to have all of our issues closed for submission. The issues actually give us a good place to document the current state of the project. I'd say leave it open.
Right. Maybe people can help us on this in the future if they are interested.
We want a
MultipleNGram
class. Here's my understanding of writing custom Transformers:pyspark.ml.Transformer
._transform
method.inputCol
andoutputCol
. These should be attributes of the object of typepyspark.ml.params.Param
.