combust / mleap

MLeap: Deploy ML Pipelines to Production
https://combust.github.io/mleap-docs/
Apache License 2.0
1.5k stars 313 forks source link

Mleap and python 3.8 #842

Closed drei34 closed 1 year ago

drei34 commented 1 year ago

Hi,

I have more of a question than a specific issue.

I was trying to use python 3.8 but my question is does mleap support this? What version would run? I know some changes were made which seems to suggest this, but unsure if they are in the stable release yet.

Thank you!

jsleight commented 1 year ago

I've been using mleap and python 3.8 for quite a while. Both mleap v0.20 and v0.21.x should work, I can't remember of 0.19 did or not (probably yes).

drei34 commented 1 year ago

@jsleight Thanks! Have you ever used serialize to bundle? I'm on python 3.8 and 0.20 mleap. Java 8 and Scala 2.12. I make the context as below, but I can't serialize a pipeline. I pasted some code below.

image

` def gen_spark_session(): return SparkSession.builder.appName("happy").config( "hive.exec.dynamic.partition", "True").config( "hive.exec.dynamic.partition.mode", "nonstrict").config( "spark.jars.excludes", "net.sourceforge.f2j:arpack_combined_all").config( "spark.jars.packages", "ml.combust.mleap:mleap-spark_2.12:0.20.0," "ml.combust.mleap:mleap-spark-base_2.12:0.20.0" ).enableHiveSupport().getOrCreate()

spark = gen_spark_session()

dfTrainFake = spark.createDataFrame([ (7,20,3,6,1,10,3,53948,245351,1), (7,20,3,6,1,10,3,53948,245351,1) ], ['dayOfWeek','hour','channel','platform','deviceType','adUnit','pageType','zip','advertiserId','label']) print(type(dfTrainFake))

# Do some stuff pipelineModel.serializeToBundle(locMLeapModelPath, trainPredictions) `

jsleight commented 1 year ago

I suspect you need an additional jar(s). I have all of these as dependencies:

ml.combust.mleap:mleap-spark_2.12
ml.combust.mleap:mleap-core_2.12
ml.combust.mleap:mleap-runtime_2.12
ml.combust.mleap:mleap-avro_2.12
ml.combust.mleap:mleap-spark-extension_2.12
ml.combust.mleap:mleap-spark-base_2.12
ml.combust.bundle:bundle-ml_2.12
ml.combust.bundle:bundle-hdfs_2.12
ml.combust.mleap:mleap-tensorflow_2.12
ml.combust.mleap:mleap-xgboost-spark_2.12

That full list is probably overkill, but probably you need one/both of bundle ones.

drei34 commented 1 year ago

So unfortunately this still does not work. I get the same error. I did as below to get the spark session (is it right to do jars.packages?). I also did the two commands below to set the PATHS. It was complaining about a Python 3.7 / Python 3.8 (prob can add this to bash_rc). Is the only thing left to try all the jars.packages above?

export PYSPARK_PYTHON=/opt/conda/miniconda3/bin/python
export PYSPARK_DRIVER_PYTHON=/opt/conda/miniconda3/bin/python
image
drei34 commented 1 year ago

I added all of these except the below and still no luck.

ml.combust.mleap:mleap-tensorflow_2.12 ml.combust.mleap:mleap-xgboost-spark_2.12 ml.combust.mleap:mleap-avro_2.12

I am doing this:

`
def gen_spark_session():
    return SparkSession.builder.appName("happy").config(
        "hive.exec.dynamic.partition", "True").config(
        "hive.exec.dynamic.partition.mode", "nonstrict").config(
        "spark.jars.excludes", "net.sourceforge.f2j:arpack_combined_all").config(
        "spark.jars.packages",
        "ml.combust.mleap:mleap-spark_2.12:0.20.0,"
        "ml.combust.mleap:mleap-spark-base_2.12:0.20.0,"
        "ml.combust.mleap:mleap-spark_extension_2.12:0.20.0,"
        "ml.combust.mleap:mleap-runtime_2.12:0.20.0,"
        "ml.combust.mleap:mleap-core_2.12:0.20.0,"
        "ml.combust.bundle:bundle-ml_2.12:0.20.0,"
        "ml.combust.bundle:bundle-hdfs_2.12:0.20.0"
        ).enableHiveSupport().getOrCreate()

spark = gen_spark_session()
`
drei34 commented 1 year ago

@jsleight Is it possible the problem is conda and SPARK? I.e. I am thinking the issue might be as in the link below. It seems they very carefully select some env variables to make this error go away (note it's not the same error as I have, but the JVM not having something is the complaint ... Thinking this is related) ... Thanks in advance or do you know someone who might know the issue here, sort of stuck. Also, it seems that I have py4j-0.10.9-src.zip ... I imagine is this OK?

Additionally, are these jars needed to build and ultimately serialize a xgboost model in your pipeline?

ml.combust.mleap:mleap-tensorflow_2.12
ml.combust.mleap:mleap-xgboost-spark_2.12

ENV variables: https://sparkbyexamples.com/pyspark/pyspark-py4j-protocol-py4jerror-org-apache-spark-api-python-pythonutils-jvm/

jsleight commented 1 year ago

Hmmm, I can't reproduce this. Can you post a complete minimal example? E.g., using python 3.8.0, pyspark 3.2.2, and mleap 0.20.0, py4j 0.10.9.5, this code works for me.

import pyspark
import pyspark.ml
spark = (
    pyspark.sql.SparkSession.builder.appName("happy")
    .config("hive.exec.dynamic.partition", "True")
    .config("hive.exec.dynamic.partition.mode", "nonstrict")
    .config("spark.jars.excludes","net.sourceforge.f2j:arpack_combined_all")
    .config("spark.jars.packages",
        "ml.combust.mleap:mleap-spark_2.12:0.20.0,"
        "ml.combust.mleap:mleap-spark-base_2.12:0.20.0,"
        "ml.combust.bundle:bundle-ml_2.12:0.20.0,"
        "ml.combust.bundle:bundle-hdfs_2.12:0.20.0"
    )
    .enableHiveSupport().getOrCreate()
)

features = ['dayOfWeek','hour','channel','platform','deviceType','adUnit','pageType','zip','advertiserId']
df = spark.createDataFrame(
    [
        (7,20,3,6,1,10,3,53948,245351,1),
        (7,20,3,6,1,10,3,53948,245351,1)
    ],
    features + ['label']
)

pipeline = pyspark.ml.Pipeline(stages=[
    pyspark.ml.feature.VectorAssembler(inputCols=features, outputCol="features"),
    pyspark.ml.classification.LogisticRegression(featuresCol="features", labelCol="label"),
])
model = pipeline.fit(df)
predictions = model.transform(df)

from mleap.pyspark import spark_support
local_path = "jar:file:/tmp/pyspark.example.zip"
model.serializeToBundle(local_path, predictions)
deserialized_model = pyspark.ml.PipelineModel.deserializeFromBundle(local_path)
deserialized_model.transform(df).show()
drei34 commented 1 year ago

Thanks! So I have pyspark 3.1.3, mleap 0.20.0, python 3.8.15. I am on a google data proc 2.0 box and I put the jars above in /usr/lib/spark/jars .

I took your code (let's fix that) and it gave me this JVM error Py4JError: ml.combust.mleap.spark.SimpleSparkSerializer does not exist in the JVM. I have py4j 10.9, is that also what you have? My speculation is this: https://sparkbyexamples.com/pyspark/pyspark-py4j-protocol-py4jerror-org-apache-spark-api-python-pythonutils-jvm/ but I have not been able to fix my problem ... What are your system variables. Are you using anaconda?

drei34 commented 1 year ago

Specifically, if I close a terminal I get this error first: Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py", line 477, in main ("%d.%d" % sys.version_info[:2], version)) Exception: Python in worker has different version 3.7 than that in driver 3.8, PySpark cannot run with different minor versions. Please check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

I fix this by setting as below, then I get the JVM one.

export PYSPARK_PYTHON=/opt/conda/miniconda3/bin/python export PYSPARK_DRIVER_PYTHON=/opt/conda/miniconda3/bin/python

I.e. this:

image image
jsleight commented 1 year ago

Additionally, are these jars needed to build and ultimately serialize a xgboost model in your pipeline?

you'll need the xgboost jar to serialize an xgboost model. The tensorflow jar is for if you want to serialize a tensorflow model.

jsleight commented 1 year ago

I'm using pip and virtualenv, don't have any relevant env variables.

I have py4j 0.10.9.5, but it is just being pulled in as a dependency from pyspark

Mleap 0.20.0 is built for spark 3.2.0 which might cause some of these problems

drei34 commented 1 year ago

OK thank you again - will try 0.19.0. This should work with 3.1.3 right? Seems so by the website: https://github.com/combust/mleap

Update: I downgraded mleap to 0.19.0 and pulled the old jars from maven, but still have the same error.

import pyspark
import pyspark.ml
spark = (
    pyspark.sql.SparkSession.builder.appName("happy")
    .config("hive.exec.dynamic.partition", "True")
    .config("hive.exec.dynamic.partition.mode", "nonstrict")
    .config("spark.jars.excludes","net.sourceforge.f2j:arpack_combined_all")
    .config("spark.jars.packages",
        "ml.combust.mleap:mleap-spark_2.12:0.19.0,"
        "ml.combust.mleap:mleap-spark-base_2.12:0.19.0,"
        "ml.combust.bundle:bundle-ml_2.12:0.19.0,"
        "ml.combust.bundle:bundle-hdfs_2.12:0.19.0"
    )
    .enableHiveSupport().getOrCreate()
)

features = ['dayOfWeek','hour','channel','platform','deviceType','adUnit','pageType','zip','advertiserId']
df = spark.createDataFrame(
    [
        (7,20,3,6,1,10,3,53948,245351,1),
        (7,20,3,6,1,10,3,53948,245351,1)
    ],
    features + ['label']
)

pipeline = pyspark.ml.Pipeline(stages=[
    pyspark.ml.feature.VectorAssembler(inputCols=features, outputCol="features"),
    pyspark.ml.classification.LogisticRegression(featuresCol="features", labelCol="label"),
])
model = pipeline.fit(df)
predictions = model.transform(df)

from mleap.pyspark import spark_support
local_path = "jar:file:/tmp/pyspark.example.zip"
model.serializeToBundle(local_path, predictions)
deserialized_model = pyspark.ml.PipelineModel.deserializeFromBundle(local_path)
deserialized_model.transform(df).show()
jsleight commented 1 year ago

https://github.com/combust/mleap#mleapspark-version has the version compatibility. 0.19.0 was spark 3.0.2. All of these compatibilities are just the versions which are explicitly tested, so other things might work but I don't have any conclusive evidence one way or another.

The class not being in the JVM would support your idea of something being weird with your env variables.

drei34 commented 1 year ago

Gotcha I'll keep looking. Another thing that's weird is I see the error is from here as below. Is there some request being sent somewhere? This seems to complaint that the Java side is empty so I wonder if the problem is some request that gets blocked due to some set ip (very unsure).

image
drei34 commented 1 year ago

Digging around more I think this should work: pyspark --packages ml.combust.mleap:mleap-spark_2.12:0.16.0 predict.py where here I am specifying the Scala and Mleap versions. This is from here: https://github.com/combust/mleap-docs/issues/8#issuecomment-473163002 and similar to just specifying the packages when building the context, I think. I made the predict file be as above, but still no luck. I moved the MLEAP version down but this seems like a bad thing: I should not be going back to MLEAP versions from years ago to make this all work. This is Python 3.8, Mleap 0.16, PySpark 3.1.3 ... It would seem that this should all be OK but I'm at a loss as to what to do about this error ... I guess the only thing left to do is to make a conda env with: python 3.8.0, pyspark 3.2.2, and mleap 0.20.0, py4j 0.10.9.5 and hope it works. I believe there is something wrong w the paths on Google machines or something, I can't come up with anything else on this frustrating issue.

image
import pyspark
import pyspark.ml

spark=pyspark.sql.SparkSession.builder.config('spark.jars.packages', 'ml.combust.mleap:mleap-spark_2.12:0.16.0,ml.combust.mleap:mleap-runtime_2.12:0.16.0,ml.combust.bundle:bundle-ml_2.12:0.16.0,ml.combust.bundle:bundle-hdfs_2.12:0.16.0"').config("spark.jars.excludes", "net.sourceforge.f2j:arpack_combined_all").getOrCreate()

features = ['dayOfWeek','hour','channel','platform','deviceType','adUnit','pageType','zip','advertiserId']
df = spark.createDataFrame(
    [
        (7,20,3,6,1,10,3,53948,245351,1),
        (7,20,3,6,1,10,3,53948,245351,1)
    ],
    features + ['label']
)

# Works just file on 1.4, not on 2.0.
pipeline = pyspark.ml.Pipeline(stages=[
    pyspark.ml.feature.VectorAssembler(inputCols=features, outputCol="features"),
    pyspark.ml.classification.LogisticRegression(featuresCol="features", labelCol="label"),
])
model = pipeline.fit(df)
predictions = model.transform(df)

from mleap.pyspark import spark_support
local_path = "jar:file:/tmp/pyspark.example.zip"
model.serializeToBundle(local_path, predictions)
deserialized_model = pyspark.ml.PipelineModel.deserializeFromBundle(local_path)
deserialized_model.transform(df).show()
jsleight commented 1 year ago

My experience with the py4j Answer from java side is empty errors is usually that spark's jvm died. The test case I did above was with java 8, but I'm pretty sure java 11 will work too (and certainly java 11 works with latest mleap). Notably java 17 won't work because spark doesn't support that java version.

drei34 commented 1 year ago

Actually looking above in the stack trace I see this error as the root: Exception in thread "Thread-4" java.lang.NoClassDefFoundError: ml/combust/bundle/serializer/SerializationFormat ...

I added jars directly like below, and this is still happening. Really unsure why. Would it be possible to zoom with anyone on the team over this?

image

` import pyspark import pyspark.ml

spark=pyspark.sql.SparkSession.builder.config('spark.jars', '/usr/lib/spark/jars/mleap-spark-base_2.12-0.16.0.jar,/usr/lib/spark/jars/mleap-spark_2.12-0.16.0.jar,/usr/lib/spark/jars/mleap-runtime_2.12-0.16.0.jar,/usr/lib/spark/jars/bundle-ml_2.12-0.16.0.jar,/usr/lib/spark/jars/bundle-hdfs_2.12-0.16.0.jar').config("spark.jars.excludes", "net.sourceforge.f2j:arpack_combined_all").getOrCreate()

features = ['dayOfWeek','hour','channel','platform','deviceType','adUnit','pageType','zip','advertiserId']

df = spark.createDataFrame( [ (7,20,3,6,1,10,3,53948,245351,1), (7,20,3,6,1,10,3,53948,245351,1) ], features + ['label'] )

pipeline = pyspark.ml.Pipeline(stages=[ pyspark.ml.feature.VectorAssembler(inputCols=features, outputCol="features"), pyspark.ml.classification.LogisticRegression(featuresCol="features", labelCol="label"), ]) model = pipeline.fit(df) predictions = model.transform(df)

from mleap.pyspark import spark_support local_path = "jar:file:/tmp/pyspark.example.zip" model.serializeToBundle(local_path, predictions) deserialized_model = pyspark.ml.PipelineModel.deserializeFromBundle(local_path) deserialized_model.transform(df).show() `

drei34 commented 1 year ago

Basically my error looks like this and the issue seems to be some path problem: https://github.com/combust/mleap-docs/issues/8#issuecomment-427628980

jsleight commented 1 year ago

It is interesting that your error is complaining about a path with / in it. I think it is usually with . separators? Maybe I'm mis-remembering things though.

drei34 commented 1 year ago

So the .jar bundle-ml_2.12-0.19.0.jar (or 0.16.0) has this Class in it. And the path seems like it uses '/'. Another question I have I know the jars need to be added to the spark jars folder. Do they need to also be added to pyspark in some way? It's like pyspark does not see the right thing (pyspark 3.1.3, mleap <= 0.19.0).

image
drei34 commented 1 year ago

My speculation is py4j is not using the right jars in some way ... Conda might have it's own py4j and this is the confusion. But I'm still unsure. Just wondering if you saw this before ...

drei34 commented 1 year ago

The error seems to be from: /usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j and I see that there are imports happening. I put some print statements in the java_gateway.py file in this directory since that's where the exception is coming from and I want to see what is being imported. The weird part is most look like (I'm using your example still) org.apache.spark.ml.classification.LogisticRegression but the mleap ones do not have the prefix org.apache.spark but mleap was installed via pip. Why would this not be appended as such to the path? Basically the exception is complaining that ml.combust.mleap.spark.SimpleSparkSerializer is not found but I think it has access to org.apache.spark.ml.combust.mleap.spark.SimpleSparkSerializer and that's what it should be using ... Trying to figure out why this is happening also + add this hack fix to see if it even goes anywhere ...

drei34 commented 1 year ago

@jsleight Yeah so I fix that problem with the import (I know there's a better solution, but adding the prefix seems to bring in the needed class), but then I get another error. 'JavaPackage' object is not callable ... Unsure why, this error seems to be a problem if maybe you have wrong versions but I don't think that's the case. I'm on Java8, Scala 2.12, Mleap 0.19.0, PySpark 3.1.3 and Python 3.8.15 ...

image
drei34 commented 1 year ago

This is my spark context btw ... I added all the right jars, I think spark=pyspark.sql.SparkSession.builder.config('spark.jars', '/usr/lib/spark/jars/mleap-spark-base_2.12-0.19.0.jar,/usr/lib/spark/jars/mleap-spark_2.12-0.19.0.jar,/usr/lib/spark/jars/mleap-runtime_2.12-0.19.0.jar,/usr/lib/spark/jars/bundle-ml_2.12-0.19.0.jar,/usr/lib/spark/jars/bundle-hdfs_2.12-0.19.0.jar,/usr/lib/spark/jars/mleap-spark-extension_2.12-0.19.0.jar').config("spark.jars.excludes", "net.sourceforge.f2j:arpack_combined_all").getOrCreate()

jsleight commented 1 year ago

in my experience the JavaPackage object is not callable has always come from the Spark jvm not having the jars which it needs.

When I run your code examples above (using mleap 0.19.0) then it works for me.

drei34 commented 1 year ago

Is there a way to check that the context I pass has what it needs? I mean given the jars above, I should have everything. This gives a context so those jars are where they need to be.

drei34 commented 1 year ago

See https://github.com/combust/mleap/issues/845