benchflow / data-transformers

Spark scripts utilised to transform data to the BenchFlow internal formats
Other
0 stars 0 forks source link

Analyse PROs and CONs of Using Python VS Scala #10

Open VincenzoFerme opened 8 years ago

VincenzoFerme commented 8 years ago

We should carefully evaluate wether to use Python, as we are doing currently, or move to Scala.

The following consideration should be done:

  1. User-friendliness
  2. Expressiveness of the language (https://www.linkedin.com/pulse/why-i-choose-scala-apache-spark-project-lan-jiang)
  3. Performance on distributed infrastructures (http://emptypipes.org/2015/01/17/python-vs-scala-vs-spark/, https://www.linkedin.com/pulse/why-i-choose-scala-apache-spark-project-lan-jiang)
  4. Data analysis support (and its performance on distributed infrastructures) (https://www.linkedin.com/pulse/why-i-choose-scala-apache-spark-project-lan-jiang)

Analyse PROs and CONs in a structured way in this issue.

Cerfoglg commented 8 years ago

Let's look at each of those considerations:

  1. User friendliness: from my perspective, Python is a far more user friendly language in itself. It's less verbose than most languages, and it makes it easy to quickly write a script to submit to Spark. However, the ease of use of a language is ultimately very subjective, so whether or not someone chooses Python or Scala comes down to personal preference in this regard. I do personally like Python a lot, but I do realise that others may not agree, but most importantly I understand that there are far more important points to consider.
  2. Expressiveness of the language: ultimately, both Python and Scala are equally as expressive for what we need to do with Spark. What can be done with Scala can also be done with Python, and in either case you are still creating a spark context and calling functions on that (count, map, reduce, join, ...).
  3. Performance: Spark was written in Scala, so it goes without saying that Scala will have an advantage over Python in most cases. Where Python, or at least pypy, gives equal performances, as seen in the link you've provided, is when we are working with a lot of core, when now the language becomes irrelevant as the time required becomes bounded by other language independent factors. This is something to consider, because if we are working with a lot of cores then performance becomes less of a concern when choosing the language, and other factors might be more important to consider. Still, Scala has an advantage, and all evidence supports this fact.
  4. Data analysis support: it's true that Python has an edge thanks to some great data analysis libraries like Pandas, but it's also true that those libraries were designed to work on single systems rather than distributed ones. That's not to say that they can't come in handy and be of great use where Scala libraries may not perform as well. It's hard to know for sure until we actually start working with the analysers, getting a good idea as to what we actually have to do in our Spark scripts to compute the needed metrics.

One other thing that I think is worth noting is that we don't necessarely have to choose between only Python and only Scala. Since we have a microservice for sending scripts to Spark, it's possible to extend it to handle both Scala and Python scripts, and choose what language to use depending on the situation.

VincenzoFerme commented 8 years ago

@Cerfoglg thank you for the nice overview!

My considerations:

  1. I agree, it is very personal which language you consider more user friendly.
  2. Ok
  3. We work on multi-core infrastructures. So lets proceed with #9. I had a deeper look at the number reported on the link I provided, and for data-transformer pypy seems a good choice because: pypy is slightly better than scala in absolute performance, but the slowest pypy task is sortByKey, and we usually don't need to sort the data. Moreover for the tasks we are mostly executing pypy is much faster that scala (e.g., for reduceByKey used for "Count the number of occurrences of a key")
  4. Right, I agree.

Good suggestion, lets keep the Issue open and proceed with Python and the pypy environment for now. We can think about using Scala for the analysers, if it makes sense (and it seems it makes, according to what is written on the following link: https://www.linkedin.com/pulse/why-i-choose-scala-apache-spark-project-lan-jiang).