GMOD / Apollo

Genome annotation editor with a Java Server backend and a Javascript client that runs in a web browser as a JBrowse plugin.
http://genomearchitect.readthedocs.io/
Other
128 stars 85 forks source link

(2d) stress test / bootstrap large data #137

Closed nathandunn closed 9 years ago

nathandunn commented 9 years ago

Moved to #507 - [ ] pre-fill database with annotations to observe differences - [ ] do a network latency test (off of icebox) to observe differences

NAL / i5K can have:

Typically numbers:

Potential important Optimizations for write / delete:

selewis commented 9 years ago

I still am strongly against relational. It just isn't suited to genomic feature data. I think it will waste time going in this direction.

On Sun, Jan 11, 2015 at 5:20 PM, Nathan Dunn notifications@github.com wrote:

NAL / i5K can have:

  • 100 organisms per server( ? . . not sure what that limit is)
  • 5K annotations per organism

    up to 100K sequences per organism

    so maybe 1Mil annotations per server

— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/137.

nathandunn commented 9 years ago

Can you provide some examples / reasoning? I think that are many examples of this working, but I might be missing something. I’m not wedded to one or the other either way, just want good reasoning either way. Going non-relational won’t necessarily give us a speed-up if done for the wrong reasons.

Nathan

On Jan 12, 2015, at 1:18 PM, selewis notifications@github.com wrote:

I still am strongly against relational. It just isn't suited to genomic feature data. I think it will waste time going in this direction.

On Sun, Jan 11, 2015 at 5:20 PM, Nathan Dunn notifications@github.com wrote:

NAL / i5K can have:

  • 100 organisms per server( ? . . not sure what that limit is)

- 5K annotations per organism

up to 100K sequences per organism

so maybe 1Mil annotations per server

— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/137.

— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/137#issuecomment-69647733.

selewis commented 9 years ago

When you have a chance to skype, or you're back (no rush)

On Mon, Jan 12, 2015 at 1:29 PM, Nathan Dunn notifications@github.com wrote:

Can you provide some examples / reasoning? I think that are many examples of this working, but I might be missing something. I’m not wedded to one or the other either way, just want good reasoning either way. Going non-relational won’t necessarily give us a speed-up if done for the wrong reasons.

Nathan

On Jan 12, 2015, at 1:18 PM, selewis notifications@github.com wrote:

I still am strongly against relational. It just isn't suited to genomic feature data. I think it will waste time going in this direction.

On Sun, Jan 11, 2015 at 5:20 PM, Nathan Dunn notifications@github.com wrote:

NAL / i5K can have:

  • 100 organisms per server( ? . . not sure what that limit is)

- 5K annotations per organism

up to 100K sequences per organism

so maybe 1Mil annotations per server

— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/137.

— Reply to this email directly or view it on GitHub < https://github.com/GMOD/Apollo/issues/137#issuecomment-69647733>.

— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/137#issuecomment-69649576.

nathandunn commented 9 years ago

Let’s chat when I’m down next. I’ve got quite a bit of work to do before we have to make that decision. Most of the grails code is pretty database neutral thus far.

Nathan

On Jan 12, 2015, at 1:49 PM, selewis notifications@github.com wrote:

When you have a chance to skype, or you're back (no rush)

On Mon, Jan 12, 2015 at 1:29 PM, Nathan Dunn notifications@github.com wrote:

Can you provide some examples / reasoning? I think that are many examples of this working, but I might be missing something. I’m not wedded to one or the other either way, just want good reasoning either way. Going non-relational won’t necessarily give us a speed-up if done for the wrong reasons.

Nathan

On Jan 12, 2015, at 1:18 PM, selewis notifications@github.com wrote:

I still am strongly against relational. It just isn't suited to genomic feature data. I think it will waste time going in this direction.

On Sun, Jan 11, 2015 at 5:20 PM, Nathan Dunn notifications@github.com wrote:

NAL / i5K can have:

  • 100 organisms per server( ? . . not sure what that limit is)

- 5K annotations per organism

up to 100K sequences per organism

so maybe 1Mil annotations per server

— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/137.

— Reply to this email directly or view it on GitHub < https://github.com/GMOD/Apollo/issues/137#issuecomment-69647733>.

— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/137#issuecomment-69649576.

— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/137#issuecomment-69652901.

cmdcolin commented 9 years ago

I was going to propose bulk importing an existing GFF with known protein sequences and then checking the output of our fasta export against the known protein sequences.

cmdcolin commented 9 years ago

I started doing the bulk import but the memory gets out of hand. Need to see what's goin on

nathandunn commented 9 years ago

Try attaching yourkit to the running instance. How are you doing the imports?

cmdcolin commented 9 years ago

I was using the add_transcripts_from_gff3_to_annotations.pl script

I think I recall another user had similar memory issues with this script but I just wanted to investigate it further if we are to continue supporting this

Anyways, I got yourkit initialized so I'll see what i can do

nathandunn commented 9 years ago

So, is the issue the memory, though? Hibernate will use all available memory for cache, but it is typically pretty smart about it. I would be more willing to bet that the IO is a problem. If you look into the SQL section (especially when running in tomcat) you can see the # of queries, and average query time per each. Very curious where inserts come into this.

cmdcolin commented 9 years ago

It has been a little tricky to figure out exactly what it taking up time, and indeed sometimes the garbage collector is quite good. The YourKit software is also very good though at tracing these garbage collections.

Maybe this is random but I think one operation that gets pretty heavy is the annotator panel. It takes a very long time to find all annotations on all sequences. It starts up a lot of different queries

It ends up being pretty CPU intensive and doesn't really finish because of the large number of scaffolds that we have. It would be great if it were optimized.

activitylog

nathandunn commented 9 years ago

See #389. I had already done some of this work on the sequence panel page.

nathandunn commented 9 years ago

@cmdcolin Let me know what your test script was when you have moment. I can also take a look at this.

nathandunn commented 9 years ago

@cmdcolin Not that this is an actual performance bottleneck, but I saw this:

select this_.id as id1_630, this_.version as version2_630, this_.chunk_number as chunk_nu3_630, this_.residue as residue4_630, this_.sequence_id as sequence5_630 from sequencechunk this where this_.sequenceid=? and this.chunk_number=? limit ? 82 0 90

That is 82 ms total time (I was updating a exon boundary), avg of 0 ms each and 90 total invocations. Of course this could be quicker than going to the filesystem, as well.

nathandunn commented 9 years ago

Using jmeter for evaluation / script development. .

cmdcolin commented 9 years ago

This is indeed called under many different circumstances. It boils down to getResiduesFromSequence for doing that query.

I thought that i could optimize that with the filesystem and that actually is available on this branch: https://github.com/GMOD/Apollo/tree/seq_chunk_remove

However, it really just needs to better optimize the use of that function.

Here's an example of poor use: when we add a transcript, the NonCanonicalSplitSiteService calls the getResiduesWithAlterations which in turn calls getResiduesFromSequence repeatedly for each tiny little two-base splice site. It should ideally just get the whole sequence region, and check for the splice sites in that larger sequence.

nathandunn commented 9 years ago

I think that there are probably quite a few things like that (see my fixes for FlankingRegion). We have the backing regression tests, so it wouldn’t hurt to try and fix these.

That being said, I think we need a good baseline for performance. The most expensive operations are going to continue to be transcript addition. Trying to setup a WA1 -> WA2 migration while running YourKit is likely the best way to do this.

Nathan

On Jun 23, 2015, at 11:43 AM, Colin Diesh notifications@github.com wrote:

This is indeed called under many different circumstances. It boils down to getResiduesFromSequence for doing that query.

I thought that i could optimize that with the filesystem and that actually is available on this branch: https://github.com/GMOD/Apollo/tree/seq_chunk_remove https://github.com/GMOD/Apollo/tree/seq_chunk_remove However, it really just needs to better optimize the use of that function.

Here's an example of poor use: when we add a transcript, the NonCanonicalSplitSiteService calls the getResiduesWithAlterations which in turn calls getResiduesFromSequence repeatedly for each tiny little two-base splice site. It should ideally just get the whole sequence region, and check for the splice sites in that larger sequence.

— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/137#issuecomment-114604087.

nathandunn commented 9 years ago

@cmdcolin FYI, I opened a new issue #452. Feel free to look at this or the other one. I added your note and hope to catch a few other things of that type here. Also to note, we don't really take advantage of using queries with fmin / fmax, and instead iterate through objects in order to check on them.

nathandunn commented 9 years ago

@childers @mpoelchau . . . Sorry, was up at 3.30 am, hopefully this will make more sense:

I wonder if it would be a good time to use our migration scripts on the i5K servers to evaluate pulling WA1 data into WA2 on a large scale. This has to be done anyway. We should be able to get a better feel for possible performance bottlenecks and server sizes.

cmdcolin commented 9 years ago

can you restate that please :)

nathandunn commented 9 years ago

@cmdcolin Thanks. Should be fixed @childers @mpoelchau . . . maybe we can chat about next week.

childers commented 9 years ago

@nathandunn Sure, lets talk more about it next week. The data and user migration is certainly something that has been on our minds over here.

nathandunn commented 9 years ago

Sounds good. I may or may not be available that week. We’ll see. If not, the following week then.

That being said, we do have scripts available migrating apollo 2 to 1, users, and organisms. The last two, you’ll need to generate the data out of your databases / instances.

Nathan

On Jun 24, 2015, at 10:00 AM, childers notifications@github.com wrote:

@nathandunn https://github.com/nathandunn Sure, lets talk more about it next week. The data and user migration is certainly something that has been on our minds over here.

— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/137#issuecomment-114941879.

nathandunn commented 9 years ago

To test, will use add_transcripts_from_gff3_to_annotations.pl to load a ref seq onto the organism.

But I need to get only a few ref sequences first.

@cmdcolin reports really high CPU spiking not only for the load (which I might expect), but also for simply viewing them, possibly implicating #389 and exposing other inefficiencies, which is good.

nathandunn commented 9 years ago

Grepped 1.1 and 1.10 from this.

./add_transcripts_from_gff3_to_annotations.pl --user http://localhost:8080/apollo -u ndunn@me.com -p demo -i amel_OGSv3.2_small.gff3

nathandunn commented 9 years ago

@cmdcolin Just a note. I think you want to setup the agent directly on the java binary. Top is maxing out (as you experienced), but yourkit is showing a pretty low percentage.

nathandunn commented 9 years ago

@cmdcolin So, if we are bulk loading databases, it probably makes sense to setup the database for bulk loading (e.g., removing indexes):

http://www.postgresql.org/docs/9.2/static/populate.html

Of course, this isn't the get_features problem.

As a timing:

./add_transcripts_from_gff3_to_annotations.pl --url -u ndunn@me.com -p demo 0.20s user 0.04s system 0% cpu 45.937 total

To add just group 1.10 (143 genes) it takes 45 seconds.

It takes 4 seconds to return all of these, is the bigger concern. Should be under <1 second (like SequencePanel). Part of #389 to fix that.

nathandunn commented 9 years ago

Either way, we'll need to explore doing bulk loads. 5K annotations is a bit rough.

cmdcolin commented 9 years ago

It takes 4 seconds to return all of these, is the bigger concern.

that is my concern more as well. the browsing should feel snappy.

cmdcolin commented 9 years ago

will try and address some issues in #460 or some cherry picked version of it

nathandunn commented 9 years ago

I'm also wondering if that script if is not releasing memory, either.

nathandunn commented 9 years ago

I know hibernate maxes out, but it doesn't seem to scale linearly, which indicates an issue with memory, or a VERY large transaction.

nathandunn commented 9 years ago

Going to have to go with not linear:

./add_transcripts_from_gff3_to_annotations.pl --url -u ndunn@me.com -p demo 0.44s user 0.03s system 0% cpu 6:12.35 total

7 * more lines - 412 genes vs 143 (maybe its slow with exons . . . ??),

Note that get_features is only 425 ms (to display genomics), whereas loading annotations is 34 seconds. Most of that is data transferred, but never shown, which is an obvious reason to have the database. May just need to role my own pagination scheme. Shouldn't be too bad.

cmdcolin commented 9 years ago

Ya it's probably more linear with the number of child features, so not exactly exponential, but I think ideally we can make a solution for the "n+1 query problem" that i see mentioned a lot and instead minimize number of queries

nathandunn commented 9 years ago

From talk with @childers during dev meeting:

Stress test should be over 10 hours, with 5-10 users per organism with multiple organisms adding and removing genes.

childers commented 9 years ago

@nathandunn : @jwlin built our stress test script so that those options were configurable parameters, so we could vary number of users, number of instances and time/number of operations. We shared that with you, right?

nathandunn commented 9 years ago

Yep .. we have it here: https://github.com/GMOD/Apollo/tree/master/docs/apollo-stress-template

I should be able to get it to work with only minor changes.

nathandunn commented 9 years ago

@deepakunni3 Here you go. @jwlin wrote the original JMX / Jmeter scripts and doc and am sure would be happy to help if that route makes the most sense.

deepakunni3 commented 9 years ago

@nathandunn I updated the JMX script for Apollo 2.0 and tried running it with the following thread properties:

Number of Threads (users): 3
Ramp-Up Period (in seconds): 5
Loop Count: 8

There were couple of issues observed:

  1. StaleObjectStateException while processing request addTranscript
  2. StaleObjectStateException while processing request deleteFeature
  3. deadlock detected (when the number of users are increased to 10)
nathandunn commented 9 years ago

@deepakunni3 Thanks. Do you have the updated script (or a template for the script) checked in? Also, is this reading / writing to the same gene / isoform? I'm more concerned with these exceptions occurring when users are working on different genes, which is a more normal workflow . . . not that these errors shouldn't be attended to.

nathandunn commented 9 years ago

@deepakunni3 So, I guess what I am saying is, can we run the test with 10 users working on 10 different genes (and preferably on different organisms?). Just 10 user and 10 genes would be helpful. I think that the 10 organisms would not matter (knock on wood). They could all be on the same sequence.

deepakunni3 commented 9 years ago

@nathandunn The updated script is on master branch.

Yes, you are right. Multiple users working on the same transcript is rare. I can definitely improve the script to test with different users working on different genes.

nathandunn commented 9 years ago

Yeah, if you could do that separation that would be great. I know that Moni said that is rare, but also I think we might end up auto-locking these at some point.

Nathan

On Jul 17, 2015, at 3:28 PM, Deepak notifications@github.com wrote:

@nathandunn https://github.com/nathandunn The updated script is on master branch.

Yes, you are right. Multiple users working on the same transcript is rare. I can definitely improve the script to test with different users working on different genes.

— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/137#issuecomment-122435811.

nathandunn commented 9 years ago
RequestHandlingService  addTranscript   30.93%  280 320 0.88    2.07    0.24    0.36
FeatureService  generateTranscript  26.24%  238 320 0.74    1.83    0.17    0.32
FeatureService  convertFeatureToJSON    13.28%  120 13865   0.01    0.06    0.00    0.01
RequestHandlingService  deleteFeature   12% 109 320 0.34    8.15    0.04    0.50
NonCanonicalSplitSiteService    findNonCanonicalAcceptorDonorSpliceSites    8.37%   76  595 0.13    0.47    0.01    0.09
FeatureService  convertJSONToFeature    5.41%   49  16136   0.00    0.03    0.00    0.01
FeatureService  calculateCDS    1.65%   15  275 0.05    0.24    0.02    0.03
FeatureService  setLongestORF   1.63%   15  275 0.05    0.20    0.02    0.03
RequestHandlingService  getFeatures 0.22%   2   4   0.51    1.49    0.06    0.67
FeatureService  addFeature  0.14%   1   57  0.02    0.07    0.00    0.03
FeatureService  convertJSONToFeatureLocation    0.04%   0   16456   0.00    0.00    0.00    0.00
FeatureService  convertFeatureLocationToJSON    0.04%   0   13865   0.00    0.00    0.00    0.00
FeatureService  getTopLevelFeature  0.03%   0   135 0.00    0.01    0.00    0.00
AnnotationEditorController  getUserPermission   0%  0   4   0.01    0.02    0.00    0.01
RequestHandlingService  createJSONFeatureContainer  0%  0   960 0.00    0.00    0.00    0.00
nathandunn commented 9 years ago

I think that these error rates are far too high.

screen shot 2015-07-22 at 5 21 53 pm
nathandunn commented 9 years ago

Rerun for about an hour . . . and now am not getting any errors:

screen shot 2015-07-22 at 6 37 32 pm

Though the server is very much lagging to keep up and the database is floored:

screen shot 2015-07-22 at 6 38 42 pm
nathandunn commented 9 years ago

Clearly over time it is getting worse with the throughput of adds / deletes fired off to the database.

screen shot 2015-07-22 at 7 04 46 pm
nathandunn commented 9 years ago

Last snapshot Last snapshot:

screen shot 2015-07-22 at 7 34 34 pm
nathandunn commented 9 years ago

The snapshot from yourkit is as follows from SQL:

screen shot 2015-07-22 at 8 36 08 pm
select feature0_.id as id1_19_0_, feature0_.version as version2_19_0_, feature0_.date_created as date_cre3_19_0_, feature0_.dbxref_id as dbxref_i4_19_0_, feature0_.description as descript5_19_0_, feature0_.is_analysis as is_analy6_19_0_, feature0_.is_obsolete as is_obsol7_19_0_, feature0_.last_updated as last_upd8_19_0_, feature0_.md5checksum as md9_19_0_, feature0_.name as name10_19_0_, feature0_.sequence_length as sequenc11_19_0_, feature0_.status_id as status_12_19_0_, feature0_.symbol as symbol13_19_0_, feature0_.unique_name as unique_14_19_0_, feature0_.alteration_residue as alterat16_19_0_, feature0_.deletion_length as deletio17_19_0_, feature0_.analysis_feature_id as analysi18_19_0_, feature0_.alternate_cv_term as alterna19_19_0_, feature0_.class_name as class_n20_19_0_, feature0_.custom_alternate_cv_term as custom_21_19_0_, feature0_.custom_class_name as custom_22_19_0_, feature0_.custom_cv_term as custom_23_19_0_, feature0_.custom_ontology_id as custom_24_19_0_, feature0_.cv_term as cv_term25_19_0_, feature0_.meta_data as meta_da26_19_0_, feature0_.ontology_id as ontolog27_19_0_, feature0_.class as class15_19_0_ from feature feature0_ where feature0_.id=? 5149 0 6864
select featureloc0_.feature_id as feature_3_19_0_, featureloc0_.id as id1_25_0_, featureloc0_.id as id1_25_1_, featureloc0_.version as version2_25_1_, featureloc0_.feature_id as feature_3_25_1_, featureloc0_.fmax as fmax4_25_1_, featureloc0_.fmin as fmin5_25_1_, featureloc0_.is_fmax_partial as is_fmax_6_25_1_, featureloc0_.is_fmin_partial as is_fmin_7_25_1_, featureloc0_.locgroup as locgroup8_25_1_, featureloc0_.phase as phase9_25_1_, featureloc0_.rank as rank10_25_1_, featureloc0_.residue_info as residue11_25_1_, featureloc0_.sequence_id as sequenc12_25_1_, featureloc0_.strand as strand13_25_1_ from feature_location featureloc0_ where featureloc0_.feature_id=? 4076 1 3786
update feature set version=?, date_created=?, dbxref_id=?, description=?, is_analysis=?, is_obsolete=?, last_updated=?, md5checksum=?, name=?, sequence_length=?, status_id=?, symbol=?, unique_name=? where id=? and version=? 3291 0 4222
nathandunn commented 9 years ago
screen shot 2015-07-22 at 8 41 02 pm
nathandunn commented 9 years ago

Database results:

Summary . . worst offendors:

screen shot 2015-07-22 at 8 43 13 pm

== X 50%==

== X 25% ==

== X 24% ==