Open tyknkd opened 3 months ago
For comparison, I created an equivalent dockerized Spring Boot app here.
Notably, no exception is thrown.
This seems to suggest that the issue lies within kotlin-spark
.
In the Spring Boot version, it seems that Spark creates a local temporary directory as part of the preparation of the Spark session before it invokes the broadcast function:
app-1 | 2024-04-10 09:00:43.393 INFO 1 --- [nio-8888-exec-1] o.a.s.SparkEnv : Registering BlockManagerMasterHeartbeat
app-1 | 2024-04-10 09:00:43.410 INFO 1 --- [nio-8888-exec-1] o.a.s.s.DiskBlockManager : Created local directory at /tmp/blockmgr-c9cef486-62f2-431a-8408-1e48b933da34
app-1 | 2024-04-10 09:00:43.436 INFO 1 --- [nio-8888-exec-1] o.a.s.s.m.MemoryStore : MemoryStore started with capacity 2.1 GiB
. . .
app-1 | 2024-04-10 09:00:43.829 INFO 1 --- [nio-8888-exec-1] o.a.s.s.m.MemoryStore : Block broadcast_0 stored as values in memory (estimated size 72.0 B, free 2.1 GiB)
app-1 | 2024-04-10 09:00:43.856 INFO 1 --- [nio-8888-exec-1] o.a.s.s.m.MemoryStore : Block broadcast_0_piece0 stored as bytes in memory (estimated size 146.0 B, free 2.1 GiB)
app-1 | 2024-04-10 09:00:43.858 INFO 1 --- [ckManagerMaster] o.a.s.s.BlockManagerInfo : Added broadcast_0_piece0 in memory on 1d1b66d9e151:43605 (size: 146.0 B, free: 2.1 GiB)
app-1 | 2024-04-10 09:00:43.862 INFO 1 --- [nio-8888-exec-1] o.a.s.SparkContext : Created broadcast 0 from broadcast at SparkBroadcast.java:30
It seems the exception is thrown in the Ktor app roughly at the point where the temporary directory would have been created:
app-1 | 2024-04-10 09:53:18.476 [eventLoopGroupProxy-4-1] WARN o.a.spark.sql.internal.SharedState - URL.setURLStreamHandlerFactory failed to set FsUrlStreamHandlerFactory
app-1 | 2024-04-10 09:53:18.477 [eventLoopGroupProxy-4-1] INFO o.a.spark.sql.internal.SharedState - Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
app-1 | 2024-04-10 09:53:18.482 [eventLoopGroupProxy-4-1] WARN o.a.spark.sql.internal.SharedState - Cannot qualify the warehouse path, leaving it unqualified.
app-1 | org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "file"
. . .
app-1 | 2024-04-10 09:53:19.148 [eventLoopGroupProxy-4-1] INFO o.a.spark.storage.memory.MemoryStore - Block broadcast_0 stored as values in memory (estimated size 72.0 B, free 2.1 GiB)
app-1 | 2024-04-10 09:53:19.180 [eventLoopGroupProxy-4-1] INFO o.a.spark.storage.memory.MemoryStore - Block broadcast_0_piece0 stored as bytes in memory (estimated size 150.0 B, free 2.1 GiB)
app-1 | 2024-04-10 09:53:19.182 [dispatcher-BlockManagerMaster] INFO o.a.spark.storage.BlockManagerInfo - Added broadcast_0_piece0 in memory on dcb91ee36ad3:36665 (size: 150.0 B, free: 2.1 GiB)
app-1 | 2024-04-10 09:53:19.186 [eventLoopGroupProxy-4-1] INFO org.apache.spark.SparkContext - Created broadcast 0 from broadcast at Broadcast.kt:61
Like I responded on slack:
The file reading exception print happens when the DECIMAL encoder is pre-loaded by the Kotlin Spark API Encoders file. Can you try to instantiate the same encoder in the non-kotlin spark project to see what happens?
In the spark 3.4+ branch of the project the encoding part is completely overhauled, so this issue won't be there anymore. But it's still a WIP.
Your program still executes fine. It's a caught exception that's just logged to the output.
Thank you so much for pinpointing the source of the exception. I'm glad to know it's not because of an error on my part. Since this issue will go away with the new release, it seems like spending any more time on it would be purely academic, so I won't trouble you any more and will let you get back to your more important work on the 3.4+ fix. Thanks again and have a great weekend!
When
withSpark
is invoked dynamically in a dockerized Ktor web app, anUnsupportedFileSystemException
is thrown.Expected behavior: No exception is thrown.
A GitHub repo is here.
Broadcast.kt (from kotlin-spark-api example)
Routing.kt
Dockerfile
compose.yaml
In a shell, run:
Then, open http://localhost:8888 in a browser.
An
org.apache.hadoop.fs.UnsupportedFileSystemException
will be thrown: