databricks / spark-deep-learning

Deep Learning Pipelines for Apache Spark
https://databricks.github.io/spark-deep-learning
Apache License 2.0
1.99k stars 494 forks source link

Curious about the status of the Scala API #17

Open redsofa opened 7 years ago

redsofa commented 7 years ago

Thank you for creating this library. I haven't tried it yet but I'm assuming that there is a Scala API for this ?

sueann commented 7 years ago

Hi @redsofa, there is currently no Scala API. Are there any particular parts / workflows that would be useful for you in Scala?

redsofa commented 7 years ago

The utility functions to read images and decode them in a distributed fashion sound useful. The fast transfer learning is really neat. I can see that being useful. The UDFS are great. I'm guessing that these are easy enough to write once the other stuff exists in Scala. I would just like to work in Scala and not have to fiddle with PySpark. Am surprised that a Scala API is not the first thing out.

sueann commented 7 years ago

That totally makes sense. The reason a Python API was prioritized was because most deep learning uses happen in Python - e.g. Keras, a popular deep learning framework, is only in Python. However, that is a bit contrary to the usual Spark philosophy of Scala being the core implementation / API, and hence does not serve many of the core Spark users. It really makes sense that things like image handling and transfer learning, which do not expose the underlying deep learning frameworks at all (and thus require no python workflow to use them), should be available in Scala. Thanks so much for bringing this up!

rxin commented 7 years ago

BTW @redsofa feel free to submit a pull request to add support for Scala!

dengelha commented 7 years ago

Thank you for open-sourcing this great library! In my view, it absolutely makes sense to develop a Scala API. I actually learned Scala due to Spark, as I considered the language to be preferred by the Spark developers. I would love to see some advances concerning a Scala API for RNNs for NLP, especially Named-Entity Recognition. I guess even the Stanford CoreNLP guys are implementing Tensorflow. So any progress in combining Spark, NLP and Tensorflow in Scala would very, very much appreciated. Thank you for your efforts bringing Deep Learning on Spark to life, hopefully in Scala :-)

thunterdb commented 7 years ago

A scala API for some of the transformers and estimators is definitely under consideration.

For the high level transformers such as image transformers, or well-known models, it is pretty easy to do if you have some understanding of the scala API of Spark, as the heavy work is done by TensorFrames.

Keras and TensorFlow transformers are much more tied to the python ecosystem though, simply because these libraries are mostly useful in python. For instance, the C++ part of tensorflow does not include automatic derivation and all the optimizers (yet), so a lot of thing would need to be replicated in scala.

maziyarpanahi commented 5 years ago

Hi guys, any update for Scala APIs at least for parts that don't need any Python packages such as loading images, etc.? Many thanks.

thunterdb commented 5 years ago

Hello Maziyar, what do you want to do besides loading images? If you just want to do image manipulation, then you can use Spark 2.4's built-in support for that, see this ticket for some examples: https://issues.apache.org/jira/browse/SPARK-22666

Tim

On Fri, Nov 23, 2018 at 6:59 PM Maziyar Panahi notifications@github.com wrote:

Hi guys, any update for Scala APIs at least for parts that don't need any Python packages such as loading images, etc.? Many thanks.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/databricks/spark-deep-learning/issues/17#issuecomment-441303947, or mute the thread https://github.com/notifications/unsubscribe-auth/AHPjAZtSWlLm_Xr21ZtREDrkoaqEnHZxks5uyEWGgaJpZM4NzJ5_ .

maziyarpanahi commented 5 years ago

Hi @thunterdb,

Thanks for the reply. I understand loading images now is possible in Spark 2.4 (or with some workarounds even in 2.3), but most of the Hadoop vendors (Hortonworks, Cloudera, etc.) move to new versions of Spark after a few months.

However, my concerns when it comes to any package/library of Spark are not just supporting some small functionalities for Scala. Is more about not abandoning the language that started it all!

Current situation The Python language has started to be dominated over Scala in the Spark ecosystem simply because of many packages and libraries most data scientists and researches been using for years. I understand it is easier to build upon that for starting a new project for Spark in the same domain of data science (ML, DL, etc.). However, failing to do the same for Scala the very same language that the entire ecosystem of Spark is built upon and has the fastest performance and most updated APIs is something not acceptable.

Future of Scala in Spark If everyone keeps doing this and rely on what it was developed before so let's lock down the entire project for Python and give some small functionalities to Scala (which is not the point and it's pretty much useless), pretty soon the Scala is going to be only used for ETL/data engineering and everything related to ML, DL, and AI in the field of data science will be only for Python and R (this is what we see today, but it will be an established fact in future).

I think treating Scala like this is unfortunate and a company like Databricks has some responsibilities to not let the language such as Scala just to be used for only some small data engineering operations in Spark.

PS: Sorry if this sounds unpleasant. I do appreciate all the efforts in an Open Source community no matter what. I really love Spark/Scala (I learned Scala for Spark!). Also, I am doing Machine learning, deep learning, scientific visualizations in Spark. Though, recently I realized Scala is not the language for any of them compare to Python/R and it doesn't feel good to see it may never be. :)

Thanks again @thunterdb for the reply and I appreciate it. "My comment is not about you, it's more about me :)"

kevinykuo commented 5 years ago

Wanted to point out that, in general, providing a Scala API for a Spark extension also means that R users get access to it (via sparklyr.)

thunterdb commented 5 years ago

@maziyarpanahi thanks for your expressing your concerns, I had a number of questions along the same line at the Spark Summit. The fact that these questions are being raised is a testament of the universal appeal of Spark across different communities that work with different pieces of the API.

To clarify: Scala is the language with which Spark is written. There is no question that it is here to stay, and the core innovations inside Spark (project hydrogen, continuous processing, data source V2) are all written in scala first, simply because of performance reasons.

That being said, Spark strives at adopting the customs of the different communities it caters to, and at being as transparent as possible in the work of the users. In the space of Deep Learning, as you point out, most of the libraries are accessed in python, which is why the focus of the API has been on serving users who were already proficient in the python ecosystem. There are already some efforts underway for java/scala, such as DeepLearning4J and TensorFrames, and with the stabilization of the hydrogen API, I expect to see some good progress in that area.

maziyarpanahi commented 5 years ago

@thunterdb Thanks for your reasonable response, I really appreciate it. What you have said absolutely makes sense. I have also seen the efforts by Intel, Microsoft, and others in entering Scala into Machine Learning and Deep Learning that take advantage of Spark ecosystem. To be honest, I have always looked up to Databricks as a leading company when it comes to the roadmap of Spark. Although, Spark is entirely Open Source and can/will go in any direction, Databricks has done a great job to direct the Spark community towards a unified analytics engine since the inception of the company itself.

I think if we can have some sort of awesome-spark-scala list on GitHub to see all the efforts and undergoing projects in ML and DL would be heartwarming and reassure that transforming Scala in Data Science community is a plan for sure.

I feel a lot better now that I heard this from you. It's a long road to go, but as long as we are going down that road I am perfectly fine. (I am by no means a Scala developer compare to pure Scala developers. But there is something about the language that makes me happy!)

Again Timothy, thanks for your time and sorry if I sounded a bit melodramatic.