Closed kanielc closed 9 years ago
We don't recommend to use the embedded mongo neither a local mongo to test the performance because your program is sharing resources with mongo.
A good environment would be a cluster (or a single external mongo), for instance, a replica set with three machines. Take care with your bandwidth as well.
We tested scala-driver and java-driver performance and in some parts scala is faster.
A real world application is always a good idea.
What you want to compare with this library?
I see some performance issues in the code we're working with, and I'm not sure if it's in the library or elsewhere. It's not terribly important right now, but it may be something I can tackle later.
Right now I'm more blocked by the absence of 'update' for saving to Mongo. I'm instead using the Mongo-Hadoop connector for that, but I may just bite the bullet and implement it in spark-mongodb.
We implement replaceOne.
An example of this would be:
import org.apache.spark.sql.SQLContext._
import org.apache.spark.sql._
// if your collection already exist, use mongodb _id field for "_idField" parameter, else choose the field that you want.
val options = Map("host" -> "localhost:27017", "database" -> "highschool", "collection" -> "students", "_idField" -> "name")
case class Student(name: String, age: Int)
val dfw: DataFrame = sqlContext.createDataFrame(sc.parallelize(List(Student("Torcuato", 46))))
dfw.write.format("com.stratio.provider.mongodb").mode(SaveMode.Append).options(options).save()
val dfr = sqlContext.read.format("com.stratio.provider.mongodb").options(options).load
dfr.show
//Then, we update the age of Torcuato to 56:
val dfw: DataFrame = sqlContext.createDataFrame(sc.parallelize(List(Student("Torcuato", 56))))
dfw.write.format("com.stratio.provider.mongodb").mode(SaveMode.Append).options(options).save()
val dfr = sqlContext.read.format("com.stratio.provider.mongodb").options(options).load
dfr.show
Is what you need?
No, not quite.
replaceOne replaces the entire document, I wanted a want to update specific fields via operators like $addToSet and $setOnInsert, etc.
I think I've found a way to do it (locally so far, but haven't tried it in our code). If the field names are all operators (save for the _idField), then use updateOne instead of replaceOne.
First of all we are going to check updateOne and then think about implement update.
OK great. We can close this issue now. It was just to sound out the performance idea.
Thanks for your feedback! :)
Do you guys use anything specifically to test the performance of the library?
I'm thinking something along the lines of:
A real world application with a very deterministic flow would work as well. Scala is notorious for having fairly innocuous things that are detrimental to performance. With a performance test of sorts, some of these could be identified in the code and fixed.
Just an idea if something doesn't already exist.