Stratio / Spark-MongoDB

Spark library for easy MongoDB access
http://www.stratio.com
Apache License 2.0
307 stars 99 forks source link

Performance testing #19

Closed kanielc closed 9 years ago

kanielc commented 9 years ago

Do you guys use anything specifically to test the performance of the library?

I'm thinking something along the lines of:

  1. Take a large json file (maybe 100 MB)
  2. Write it to the embedded test mongo
  3. Do some operations involving the data (self-join, queries, counts, etc). Enough things to force at least 1 read of the full collection.
  4. Output back to a json file (perhaps a copy of the original json, easy to validate then).

A real world application with a very deterministic flow would work as well. Scala is notorious for having fairly innocuous things that are detrimental to performance. With a performance test of sorts, some of these could be identified in the code and fixed.

Just an idea if something doesn't already exist.

pmadrigal commented 9 years ago

We don't recommend to use the embedded mongo neither a local mongo to test the performance because your program is sharing resources with mongo.

A good environment would be a cluster (or a single external mongo), for instance, a replica set with three machines. Take care with your bandwidth as well.

We tested scala-driver and java-driver performance and in some parts scala is faster.

A real world application is always a good idea.

What you want to compare with this library?

kanielc commented 9 years ago

I see some performance issues in the code we're working with, and I'm not sure if it's in the library or elsewhere. It's not terribly important right now, but it may be something I can tackle later.

Right now I'm more blocked by the absence of 'update' for saving to Mongo. I'm instead using the Mongo-Hadoop connector for that, but I may just bite the bullet and implement it in spark-mongodb.

pmadrigal commented 9 years ago

We implement replaceOne.

An example of this would be:

import org.apache.spark.sql.SQLContext._
import org.apache.spark.sql._

// if your collection already exist, use mongodb _id field for "_idField" parameter, else choose the field that you want.

val options = Map("host" -> "localhost:27017", "database" -> "highschool", "collection" -> "students", "_idField" -> "name")
case class Student(name: String, age: Int)
val dfw: DataFrame = sqlContext.createDataFrame(sc.parallelize(List(Student("Torcuato", 46))))
dfw.write.format("com.stratio.provider.mongodb").mode(SaveMode.Append).options(options).save()
val dfr = sqlContext.read.format("com.stratio.provider.mongodb").options(options).load
dfr.show

//Then, we update the age of Torcuato to 56:

val dfw: DataFrame = sqlContext.createDataFrame(sc.parallelize(List(Student("Torcuato", 56))))
dfw.write.format("com.stratio.provider.mongodb").mode(SaveMode.Append).options(options).save()
val dfr = sqlContext.read.format("com.stratio.provider.mongodb").options(options).load
dfr.show

Is what you need?

kanielc commented 9 years ago

No, not quite.

replaceOne replaces the entire document, I wanted a want to update specific fields via operators like $addToSet and $setOnInsert, etc.

I think I've found a way to do it (locally so far, but haven't tried it in our code). If the field names are all operators (save for the _idField), then use updateOne instead of replaceOne.

pmadrigal commented 9 years ago

First of all we are going to check updateOne and then think about implement update.

kanielc commented 9 years ago

OK great. We can close this issue now. It was just to sound out the performance idea.

pmadrigal commented 9 years ago

Thanks for your feedback! :)