Open amoyrand opened 3 years ago
the pipelines
folder is not in your path. See main.py, basically it can't find the module.
in the main, I have the lines:
dirname = os.path.abspath(os.path.dirname(__file__))
sys.path.insert(0, (os.path.join(dirname, 'pipelines')))
But still get the ModuleNotFoundError: No module named 'pipelines'
error...
Have you renamed the pipelines folder, or moved the location relative to main? What version of Python are you using?
From: amoyrand @.> Sent: 31 March 2021 08:33 To: DataThirstLtd/Databricks-Connect-PySpark @.> Cc: Simon D'Morias @.>; Comment @.> Subject: Re: [DataThirstLtd/Databricks-Connect-PySpark] Don't work with pandas udf (#6)
in the main, I have the lines:
dirname = os.path.abspath(os.path.dirname(file))
sys.path.insert(0, (os.path.join(dirname, 'pipelines')))
But still get the ModuleNotFoundError: No module named 'pipelines' error...
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/DataThirstLtd/Databricks-Connect-PySpark/issues/6#issuecomment-810844776, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADBMOSAXIIG4AO4KLHS6SJDTGLFZ3ANCNFSM42CMRHCQ.
Hello @simondmorias . I finally had this working. thanks for you tips.
I'm now facing another problem:
I'm using sedona with databricks. when running my code in a notebook everything goes well (I installed the thrid party jars on the cluster)
but when running with databricks-connect, i'm getting a TypeError: 'JavaPackage' object is not callable
when running;
spark = SparkSession. \
builder. \
appName('appName'). \
config("spark.serializer", KryoSerializer.getName). \
config("spark.kryo.registrator", SedonaKryoRegistrator.getName). \
config('spark.jars.packages',
'org.apache.sedona:sedona-python-adapter-3.0_2.12:1.0.0-incubating,'
'org.datasyslab:geotools-wrapper:geotools-24.0'). \
getOrCreate()
SedonaRegistrator.registerAll(spark)
I guess that the jars are not well imported in local.
Did you ever experienced this? do you know how to import local jars with databricks-connect ?
Thank you
On your local machine run databricks-connect get-jar-dir
- add the jars there.
Hello. I got the registerAll working but then have another issue with databricks connect:
from pyspark.sql import SparkSession
from sedona.register import SedonaRegistrator
from sedona.utils import KryoSerializer, SedonaKryoRegistrator
sparkSession = SparkSession. \
builder. \
master("local[*]").\
appName('appName'). \
config("spark.serializer", KryoSerializer.getName). \
config("spark.kryo.registrator", SedonaKryoRegistrator.getName). \
config('spark.jars.packages',
'org.apache.sedona:sedona-python-adapter-3.0_2.12:1.0.0-incubating,'
'org.datasyslab:geotools-wrapper:geotools-24.0'). \
getOrCreate()
SedonaRegistrator.registerAll(sparkSession)
print(sparkSession.sql('describe function st_point').show())
print(sparkSession.sql("SELECT st_point(41.40338, 2.17403) AS geometry").show())
here I can describe the UDF st_point but when trying to use it, it fails with:
Undefined function: 'st_point'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 0
full log here: https://filebin.net/yzy0tn58myzso8l4/log.txt?t=cntz46u4
Any idea what happens here ?
Thanks a lot for your help
I would post on StackOverflow - that is more of a general Spark problem rather than this container.
@amoyrand how did you solve the original problem where pipelines were not detected ?
Hello I'm trying to replicate your example in my own project. But I have an issue with python udf: always run into this error
ModuleNotFoundError: No module named 'pipelines'
I simply changed your code as is:
amazon.py:
and it gives me this error:
Any idea how to solve this ?