Closed nathandunn closed 9 years ago
I still am strongly against relational. It just isn't suited to genomic feature data. I think it will waste time going in this direction.
On Sun, Jan 11, 2015 at 5:20 PM, Nathan Dunn notifications@github.com wrote:
NAL / i5K can have:
- 100 organisms per server( ? . . not sure what that limit is)
5K annotations per organism
up to 100K sequences per organism
so maybe 1Mil annotations per server
— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/137.
Can you provide some examples / reasoning? I think that are many examples of this working, but I might be missing something. I’m not wedded to one or the other either way, just want good reasoning either way. Going non-relational won’t necessarily give us a speed-up if done for the wrong reasons.
Nathan
On Jan 12, 2015, at 1:18 PM, selewis notifications@github.com wrote:
I still am strongly against relational. It just isn't suited to genomic feature data. I think it will waste time going in this direction.
On Sun, Jan 11, 2015 at 5:20 PM, Nathan Dunn notifications@github.com wrote:
NAL / i5K can have:
- 100 organisms per server( ? . . not sure what that limit is)
- 5K annotations per organism
up to 100K sequences per organism
so maybe 1Mil annotations per server
— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/137.
— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/137#issuecomment-69647733.
When you have a chance to skype, or you're back (no rush)
On Mon, Jan 12, 2015 at 1:29 PM, Nathan Dunn notifications@github.com wrote:
Can you provide some examples / reasoning? I think that are many examples of this working, but I might be missing something. I’m not wedded to one or the other either way, just want good reasoning either way. Going non-relational won’t necessarily give us a speed-up if done for the wrong reasons.
Nathan
On Jan 12, 2015, at 1:18 PM, selewis notifications@github.com wrote:
I still am strongly against relational. It just isn't suited to genomic feature data. I think it will waste time going in this direction.
On Sun, Jan 11, 2015 at 5:20 PM, Nathan Dunn notifications@github.com wrote:
NAL / i5K can have:
- 100 organisms per server( ? . . not sure what that limit is)
- 5K annotations per organism
up to 100K sequences per organism
so maybe 1Mil annotations per server
— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/137.
— Reply to this email directly or view it on GitHub < https://github.com/GMOD/Apollo/issues/137#issuecomment-69647733>.
— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/137#issuecomment-69649576.
Let’s chat when I’m down next. I’ve got quite a bit of work to do before we have to make that decision. Most of the grails code is pretty database neutral thus far.
Nathan
On Jan 12, 2015, at 1:49 PM, selewis notifications@github.com wrote:
When you have a chance to skype, or you're back (no rush)
On Mon, Jan 12, 2015 at 1:29 PM, Nathan Dunn notifications@github.com wrote:
Can you provide some examples / reasoning? I think that are many examples of this working, but I might be missing something. I’m not wedded to one or the other either way, just want good reasoning either way. Going non-relational won’t necessarily give us a speed-up if done for the wrong reasons.
Nathan
On Jan 12, 2015, at 1:18 PM, selewis notifications@github.com wrote:
I still am strongly against relational. It just isn't suited to genomic feature data. I think it will waste time going in this direction.
On Sun, Jan 11, 2015 at 5:20 PM, Nathan Dunn notifications@github.com wrote:
NAL / i5K can have:
- 100 organisms per server( ? . . not sure what that limit is)
- 5K annotations per organism
up to 100K sequences per organism
so maybe 1Mil annotations per server
— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/137.
— Reply to this email directly or view it on GitHub < https://github.com/GMOD/Apollo/issues/137#issuecomment-69647733>.
— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/137#issuecomment-69649576.
— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/137#issuecomment-69652901.
I was going to propose bulk importing an existing GFF with known protein sequences and then checking the output of our fasta export against the known protein sequences.
I started doing the bulk import but the memory gets out of hand. Need to see what's goin on
Try attaching yourkit to the running instance. How are you doing the imports?
I was using the add_transcripts_from_gff3_to_annotations.pl script
I think I recall another user had similar memory issues with this script but I just wanted to investigate it further if we are to continue supporting this
Anyways, I got yourkit initialized so I'll see what i can do
So, is the issue the memory, though? Hibernate will use all available memory for cache, but it is typically pretty smart about it. I would be more willing to bet that the IO is a problem. If you look into the SQL section (especially when running in tomcat) you can see the # of queries, and average query time per each. Very curious where inserts come into this.
It has been a little tricky to figure out exactly what it taking up time, and indeed sometimes the garbage collector is quite good. The YourKit software is also very good though at tracing these garbage collections.
Maybe this is random but I think one operation that gets pretty heavy is the annotator panel. It takes a very long time to find all annotations on all sequences. It starts up a lot of different queries
It ends up being pretty CPU intensive and doesn't really finish because of the large number of scaffolds that we have. It would be great if it were optimized.
See #389. I had already done some of this work on the sequence panel page.
@cmdcolin Let me know what your test script was when you have moment. I can also take a look at this.
@cmdcolin Not that this is an actual performance bottleneck, but I saw this:
select this_.id as id1_630, this_.version as version2_630, this_.chunk_number as chunk_nu3_630, this_.residue as residue4_630, this_.sequence_id as sequence5_630 from sequencechunk this where this_.sequenceid=? and this.chunk_number=? limit ? 82 0 90
That is 82 ms total time (I was updating a exon boundary), avg of 0 ms each and 90 total invocations. Of course this could be quicker than going to the filesystem, as well.
Using jmeter for evaluation / script development. .
This is indeed called under many different circumstances. It boils down to getResiduesFromSequence for doing that query.
I thought that i could optimize that with the filesystem and that actually is available on this branch: https://github.com/GMOD/Apollo/tree/seq_chunk_remove
However, it really just needs to better optimize the use of that function.
Here's an example of poor use: when we add a transcript, the NonCanonicalSplitSiteService calls the getResiduesWithAlterations which in turn calls getResiduesFromSequence repeatedly for each tiny little two-base splice site. It should ideally just get the whole sequence region, and check for the splice sites in that larger sequence.
I think that there are probably quite a few things like that (see my fixes for FlankingRegion). We have the backing regression tests, so it wouldn’t hurt to try and fix these.
That being said, I think we need a good baseline for performance. The most expensive operations are going to continue to be transcript addition. Trying to setup a WA1 -> WA2 migration while running YourKit is likely the best way to do this.
Nathan
On Jun 23, 2015, at 11:43 AM, Colin Diesh notifications@github.com wrote:
This is indeed called under many different circumstances. It boils down to getResiduesFromSequence for doing that query.
I thought that i could optimize that with the filesystem and that actually is available on this branch: https://github.com/GMOD/Apollo/tree/seq_chunk_remove https://github.com/GMOD/Apollo/tree/seq_chunk_remove However, it really just needs to better optimize the use of that function.
Here's an example of poor use: when we add a transcript, the NonCanonicalSplitSiteService calls the getResiduesWithAlterations which in turn calls getResiduesFromSequence repeatedly for each tiny little two-base splice site. It should ideally just get the whole sequence region, and check for the splice sites in that larger sequence.
— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/137#issuecomment-114604087.
@cmdcolin FYI, I opened a new issue #452. Feel free to look at this or the other one. I added your note and hope to catch a few other things of that type here. Also to note, we don't really take advantage of using queries with fmin / fmax, and instead iterate through objects in order to check on them.
@childers @mpoelchau . . . Sorry, was up at 3.30 am, hopefully this will make more sense:
I wonder if it would be a good time to use our migration scripts on the i5K servers to evaluate pulling WA1 data into WA2 on a large scale. This has to be done anyway. We should be able to get a better feel for possible performance bottlenecks and server sizes.
can you restate that please :)
@cmdcolin Thanks. Should be fixed @childers @mpoelchau . . . maybe we can chat about next week.
@nathandunn Sure, lets talk more about it next week. The data and user migration is certainly something that has been on our minds over here.
Sounds good. I may or may not be available that week. We’ll see. If not, the following week then.
That being said, we do have scripts available migrating apollo 2 to 1, users, and organisms. The last two, you’ll need to generate the data out of your databases / instances.
Nathan
On Jun 24, 2015, at 10:00 AM, childers notifications@github.com wrote:
@nathandunn https://github.com/nathandunn Sure, lets talk more about it next week. The data and user migration is certainly something that has been on our minds over here.
— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/137#issuecomment-114941879.
To test, will use add_transcripts_from_gff3_to_annotations.pl
to load a ref seq onto the organism.
But I need to get only a few ref sequences first.
@cmdcolin reports really high CPU spiking not only for the load (which I might expect), but also for simply viewing them, possibly implicating #389 and exposing other inefficiencies, which is good.
Grepped 1.1 and 1.10 from this.
./add_transcripts_from_gff3_to_annotations.pl --user http://localhost:8080/apollo -u ndunn@me.com -p demo -i amel_OGSv3.2_small.gff3
@cmdcolin Just a note. I think you want to setup the agent directly on the java binary. Top is maxing out (as you experienced), but yourkit is showing a pretty low percentage.
@cmdcolin So, if we are bulk loading databases, it probably makes sense to setup the database for bulk loading (e.g., removing indexes):
http://www.postgresql.org/docs/9.2/static/populate.html
Of course, this isn't the get_features problem.
As a timing:
./add_transcripts_from_gff3_to_annotations.pl --url -u ndunn@me.com -p demo 0.20s user 0.04s system 0% cpu 45.937 total
To add just group 1.10 (143 genes) it takes 45 seconds.
It takes 4 seconds to return all of these, is the bigger concern. Should be under <1 second (like SequencePanel). Part of #389 to fix that.
Either way, we'll need to explore doing bulk loads. 5K annotations is a bit rough.
It takes 4 seconds to return all of these, is the bigger concern.
that is my concern more as well. the browsing should feel snappy.
will try and address some issues in #460 or some cherry picked version of it
I'm also wondering if that script if is not releasing memory, either.
I know hibernate maxes out, but it doesn't seem to scale linearly, which indicates an issue with memory, or a VERY large transaction.
Going to have to go with not linear:
./add_transcripts_from_gff3_to_annotations.pl --url -u ndunn@me.com -p demo 0.44s user 0.03s system 0% cpu 6:12.35 total
7 * more lines - 412 genes vs 143 (maybe its slow with exons . . . ??),
Note that get_features is only 425 ms (to display genomics), whereas loading annotations is 34 seconds. Most of that is data transferred, but never shown, which is an obvious reason to have the database. May just need to role my own pagination scheme. Shouldn't be too bad.
Ya it's probably more linear with the number of child features, so not exactly exponential, but I think ideally we can make a solution for the "n+1 query problem" that i see mentioned a lot and instead minimize number of queries
From talk with @childers during dev meeting:
Stress test should be over 10 hours, with 5-10 users per organism with multiple organisms adding and removing genes.
@nathandunn : @jwlin built our stress test script so that those options were configurable parameters, so we could vary number of users, number of instances and time/number of operations. We shared that with you, right?
Yep .. we have it here: https://github.com/GMOD/Apollo/tree/master/docs/apollo-stress-template
I should be able to get it to work with only minor changes.
@deepakunni3 Here you go. @jwlin wrote the original JMX / Jmeter scripts and doc and am sure would be happy to help if that route makes the most sense.
@nathandunn I updated the JMX script for Apollo 2.0 and tried running it with the following thread properties:
Number of Threads (users): 3
Ramp-Up Period (in seconds): 5
Loop Count: 8
There were couple of issues observed:
@deepakunni3 Thanks. Do you have the updated script (or a template for the script) checked in? Also, is this reading / writing to the same gene / isoform? I'm more concerned with these exceptions occurring when users are working on different genes, which is a more normal workflow . . . not that these errors shouldn't be attended to.
@deepakunni3 So, I guess what I am saying is, can we run the test with 10 users working on 10 different genes (and preferably on different organisms?). Just 10 user and 10 genes would be helpful. I think that the 10 organisms would not matter (knock on wood). They could all be on the same sequence.
@nathandunn The updated script is on master branch.
Yes, you are right. Multiple users working on the same transcript is rare. I can definitely improve the script to test with different users working on different genes.
Yeah, if you could do that separation that would be great. I know that Moni said that is rare, but also I think we might end up auto-locking these at some point.
Nathan
On Jul 17, 2015, at 3:28 PM, Deepak notifications@github.com wrote:
@nathandunn https://github.com/nathandunn The updated script is on master branch.
Yes, you are right. Multiple users working on the same transcript is rare. I can definitely improve the script to test with different users working on different genes.
— Reply to this email directly or view it on GitHub https://github.com/GMOD/Apollo/issues/137#issuecomment-122435811.
RequestHandlingService addTranscript 30.93% 280 320 0.88 2.07 0.24 0.36
FeatureService generateTranscript 26.24% 238 320 0.74 1.83 0.17 0.32
FeatureService convertFeatureToJSON 13.28% 120 13865 0.01 0.06 0.00 0.01
RequestHandlingService deleteFeature 12% 109 320 0.34 8.15 0.04 0.50
NonCanonicalSplitSiteService findNonCanonicalAcceptorDonorSpliceSites 8.37% 76 595 0.13 0.47 0.01 0.09
FeatureService convertJSONToFeature 5.41% 49 16136 0.00 0.03 0.00 0.01
FeatureService calculateCDS 1.65% 15 275 0.05 0.24 0.02 0.03
FeatureService setLongestORF 1.63% 15 275 0.05 0.20 0.02 0.03
RequestHandlingService getFeatures 0.22% 2 4 0.51 1.49 0.06 0.67
FeatureService addFeature 0.14% 1 57 0.02 0.07 0.00 0.03
FeatureService convertJSONToFeatureLocation 0.04% 0 16456 0.00 0.00 0.00 0.00
FeatureService convertFeatureLocationToJSON 0.04% 0 13865 0.00 0.00 0.00 0.00
FeatureService getTopLevelFeature 0.03% 0 135 0.00 0.01 0.00 0.00
AnnotationEditorController getUserPermission 0% 0 4 0.01 0.02 0.00 0.01
RequestHandlingService createJSONFeatureContainer 0% 0 960 0.00 0.00 0.00 0.00
I think that these error rates are far too high.
Rerun for about an hour . . . and now am not getting any errors:
Though the server is very much lagging to keep up and the database is floored:
Clearly over time it is getting worse with the throughput of adds / deletes fired off to the database.
Last snapshot Last snapshot:
The snapshot from yourkit is as follows from SQL:
select feature0_.id as id1_19_0_, feature0_.version as version2_19_0_, feature0_.date_created as date_cre3_19_0_, feature0_.dbxref_id as dbxref_i4_19_0_, feature0_.description as descript5_19_0_, feature0_.is_analysis as is_analy6_19_0_, feature0_.is_obsolete as is_obsol7_19_0_, feature0_.last_updated as last_upd8_19_0_, feature0_.md5checksum as md9_19_0_, feature0_.name as name10_19_0_, feature0_.sequence_length as sequenc11_19_0_, feature0_.status_id as status_12_19_0_, feature0_.symbol as symbol13_19_0_, feature0_.unique_name as unique_14_19_0_, feature0_.alteration_residue as alterat16_19_0_, feature0_.deletion_length as deletio17_19_0_, feature0_.analysis_feature_id as analysi18_19_0_, feature0_.alternate_cv_term as alterna19_19_0_, feature0_.class_name as class_n20_19_0_, feature0_.custom_alternate_cv_term as custom_21_19_0_, feature0_.custom_class_name as custom_22_19_0_, feature0_.custom_cv_term as custom_23_19_0_, feature0_.custom_ontology_id as custom_24_19_0_, feature0_.cv_term as cv_term25_19_0_, feature0_.meta_data as meta_da26_19_0_, feature0_.ontology_id as ontolog27_19_0_, feature0_.class as class15_19_0_ from feature feature0_ where feature0_.id=? 5149 0 6864
select featureloc0_.feature_id as feature_3_19_0_, featureloc0_.id as id1_25_0_, featureloc0_.id as id1_25_1_, featureloc0_.version as version2_25_1_, featureloc0_.feature_id as feature_3_25_1_, featureloc0_.fmax as fmax4_25_1_, featureloc0_.fmin as fmin5_25_1_, featureloc0_.is_fmax_partial as is_fmax_6_25_1_, featureloc0_.is_fmin_partial as is_fmin_7_25_1_, featureloc0_.locgroup as locgroup8_25_1_, featureloc0_.phase as phase9_25_1_, featureloc0_.rank as rank10_25_1_, featureloc0_.residue_info as residue11_25_1_, featureloc0_.sequence_id as sequenc12_25_1_, featureloc0_.strand as strand13_25_1_ from feature_location featureloc0_ where featureloc0_.feature_id=? 4076 1 3786
update feature set version=?, date_created=?, dbxref_id=?, description=?, is_analysis=?, is_obsolete=?, last_updated=?, md5checksum=?, name=?, sequence_length=?, status_id=?, symbol=?, unique_name=? where id=? and version=? 3291 0 4222
Database results:
Summary . . worst offendors:
== X 50%==
== X 25% ==
== X 24% ==
Moved to #507
- [ ] pre-fill database with annotations to observe differences- [ ] do a network latency test (off of icebox) to observe differencesNAL / i5K can have:
Typically numbers:
Potential important Optimizations for write / delete: