SNICScienceCloud / LDSA-Spark

A collections of Apache Spark notebooks for the LDSA course
Apache License 2.0
0 stars 3 forks source link

Solution with calculations not distributed to workers #3

Open timkulich opened 7 years ago

timkulich commented 7 years ago

Hi,

I have the following solutions to Task 1 and Task 2 respectively:

`{ // Loads dna.txt into string and counts the letters val source = scala.io.Source.fromFile("data/dna.txt") val lines = try source.mkString finally source.close()

var gandc = lines.count(c => c == 'g')
gandc += lines.count(c => c == 'c')

val total = lines.length()

val result = (gandc.toFloat)/(total.toFloat)

println(result + " RESULT")

}`

and

`{ // for-loop of 1000 points that calculates the area of width = 1/1000 and height of f(x) for each iteration. import math.sin import math.cos val points:Double = 1000 var result:Double = 0 var i=1; for( i <- 1 to points){ result += (1/points)*(1 + sin(i.toDouble/points))/cos(i.toDouble/points) println(result) }

}`

They both seem to produce the right result. However, I'm assuming that I'm not following the condition: "Warning: all of the tasks must be solved using the Spark RDD API, in order to distribute the computations in the Spark workers."

Am I correct? Do I have to use the parallelize and reduce stuff to distribute it to the workers?

mcapuccini commented 7 years ago

Hi @timkulich yes you are correct, the point of the exercise is to use Spark. In a real-world scenario, reading in this way (scala.io.Source.fromFile("data/dna.txt")) would cause an out of memory error. You need to use the sc.textFile primitive from the SparkContext to read chunks of the file in parallel. Loading the file in the driver program and parallelize it later wouldn't work either.