Closed lduarte1991 closed 9 years ago
@lduarte @jmiranda I did a little investigating of this issue and I believe the parent condition in the search query is causing the poor performance. Here's the line in question:
Looking at the SQL queries in the log when I do a search with a single contextId
parameter (i.e. get all annotations for a course with the given context ID), I'm seeing this pattern:
Example queries from log:
SELECT Count(*) AS y0_
FROM annotation this_
WHERE ( ( this_.deleted = ?
OR this_.deleted IS NULL )
AND ( this_.archived = ?
OR this_.archived IS NULL ) )
AND this_.context_id = ?
SELECT this_.id AS id3_0_,
this_.version AS version3_0_,
this_.archived AS archived3_0_,
this_.collection_id AS collection4_3_0_,
this_.context_id AS context5_3_0_,
this_.date_created AS date6_3_0_,
this_.d eleted as deleted3_0_,
this_.json AS json3_0_,
this_.last_updated AS last9_3_0_,
this_.media AS media3_0_,
this_.owner_id AS owner11_3_0_,
this_.parent_id AS parent12_3_0_,
this_.quote AS quote3_0_,
this_.source AS so urce3_0_,
this_.text AS text3_0_,
this_.uri AS uri3_0_,
this_.userid AS userid3_0_,
this_.username AS username3_0_
FROM annotation this_
WHERE ((
this_.deleted=?
OR this_.deleted IS NULL)
AND (
this_.archived=?
OR this_ .archived IS NULL))
AND this_.context_id=?
ORDER BY this_.date_created DESC
SELECT this_.id AS id3_0_,
this_.version AS version3_0_,
this_.archived AS archived3_0_,
this_.collection_id AS collection4_3_0_,
this_.context_id AS context5_3_0_,
this_.date_created AS date6_3_0_,
this_.d eleted as deleted3_0_,
this_.json AS json3_0_,
this_.last_updated AS last9_3_0_,
this_.media AS media3_0_,
this_.owner_id AS owner11_3_0_,
this_.parent_id AS parent12_3_0_,
this_.quote AS quote3_0_,
this_.source AS so urce3_0_,
this_.text AS text3_0_,
this_.uri AS uri3_0_,
this_.userid AS userid3_0_,
this_.username AS username3_0_
FROM annotation this_
WHERE ((
this_.deleted=?
OR this_.deleted IS NULL)
AND (
this_.archived=?
OR this_ .archived IS NULL))
AND this_.parent_id=?
ORDER BY this_.date_created DESC
So it's basically an N+1 query problem, which would explain the pattern of response times that Luis observed. I'm not familiar enough with grails/groovy to make the fix myself and submit it otherwise I would. Hope that helps.
Thanks Artie! Nice work. The queries themselves shouldn't be too bad, but hydrating (loading) 100s of annotations into memory might cause the slowness.
I think you nailed it with that particular line as the code is actually loading the entire parent annotation from the database when it should just be loading a reference (proxy) to the parent annotation.
Replacing those lines:
if (params.parentid) {
def parentAnnotation = Annotation.get(params.parentid)
parent == parentAnnotation
}
with the following lines:
if (params.parentid) {
parent == Annotation.load(params.parentid)
}
should improve the performance considerably.
NOTE: I removed the assignment of parentAnnotation
because it was unnecessary.
Should be an easy fix.
I finally had a chance to test this out and it appears that @arthurian was correct. The load time was improved dramatically by making the change I suggested in my previous comment. I'll commit shortly and build a new WAR file. Deploying to the test server might take a little longer, but definitely this week.
Thanks again for the debugging @arthurian.
Should be fixed in release 0.5.9 (https://github.com/annotationsatharvard/catcha/releases/tag/v0.5.9).
@lduarte1991 - I decided to deploy to the test server tonight since I had a few extra minutes so please test the search API and let me know if the load times are better.
@jmiranda @arthurian I just ran the same test when I made the issue and there seems to be a 33% decrease in response time which is pretty nice :D Thanks Artie and Justin!
@jmiranda Been working with this for about a week now. It is definitely faster, though obviously it still takes upwards of 5-6 seconds once we are reaching around 1000 annotations or so. A bit worried about how far this will actually scale.
In the meantime though, it is doing a lot better so feel free to push to production at any point.
Thanks for the feedback. I'll open a new ticket and try to figure out ways to tweak performance a bit more. My guess is that pulling 1000s of large JSON objects from the database is causing the problem. Not sure if there's much to do there except to switch to an object store as I've always wanted to do. But I'll look to see if I forgot an index or figure out a better way to cache the objects.
I'm just getting back to work after the Thanksgiving holiday and will plan to deploy at some point this week. Let me know if there are any classes going on that would keep me from deploying it at certain times.
There's a huge difference in response time when requesting larger limits and I'm unclear why that is. I made the following request via curl (https://catch.harvardx.harvard.edu/catch/annotator/search?limit=500&offset=0&uri=https%3A%2F%2Fcourses.edx.org%2Fcourses%2FVJx%2FVJx%2F3T2014%2Fcourseware%2Fa3b18702fd42405683755a8acf297f68%2F0df13b2c7af540bf988274e78c19277c%2F3&media=image&userid&username&tag&text) and tried to change the limit from 50 up to 500. I did it 5 times per limit and the average is below:
Limit -> response time 50 = 0.88s 200 = 1.68s 300 = 2.25s 350 = 2.33s 400 = 2.76s 500 = 3.03s