Performance of JBrowse in Solr-ized P3

rwill1 commented 9 years ago

capturing the discussion so it won't be lost. We've noticed that JBrowse performance after the December release is VERY variable (from slow, to veeeerrrrrryyyyy slow - as in minutes) and doesn't seem to correlate to restarts of Solr or restarts of the website. The following is some of the analysis that has been done and sent around in emails. Captured here for posterity. It is my belief that we should do some work on this.

On 1/12/15 6:18 PM, Andrew Warren wrote: Hi all,

Becky asked me to look into why jbrowse seems to be loading slowly. It does not seem very likely that this is related to the update to the jbrowse code itself. The following requests seem to be taking a long time [43 seconds] - even on reload (this particular example happens to be for the genome browser page http://patricbrc.org/portal/portal/patric/GenomeBrowser?cType=genome&cId=1425338.3): Makers a request to the PATRIC server... http://patricbrc.org/portal/portal/patric/GenomeBrowser/GBWindow?action=b&cacheability=PAGE&mode=getTrackInfo&accession=AZIV01000001&annotation=PATRIC which issues the following SOLR query. http://macleod:8080/solr/genome_feature/select?facet.range=start&q=accession:AZIV01000001&f.start.facet.range.end=10000000&f.start.facet.range.start=0&fq=annotation:PATRIC+AND+!(feature_type:source)&facet.mincount=1&rows=0&f.start.facet.range.gap=10000&facet=true&wt=json This takes (now) 46 seconds to answer. The bulk of the time required to the track going. So SOLR being hammered or ...having some problem is the likely answer to slow jbrowse.

-Andrew

On 1/12/15 11:33 PM, Harry Yoo wrote: Yes, you're right. Solr is excellent at searching, but not optimized for sorting & fetching several thousand rows. Maulik and I discussed this issue, and one option would be reading from file instead of fetching from solr. Downloadable File itself is organized by genome name and pre-sorted by accession and coordination, but it would require re-implementation of data feeding part. I am open to any other suggestion as well.

Harry

On 1/13/15 12:04 AM, Andrew Warren wrote: Dustin and I have been doing some speculation.

I tried to do a solr interval instead of a range. https://wiki.apache.org/solr/SimpleFacetParameters#Interval_Faceting

This doesn't work because docValue is not set for start. If it were we could do something like this

Alternative 1:

http://macleod:8080/solr/genome_feature/select?facet.interval=start&f.start.facet.interval.set=%5B0,10000%5D&f.start.facet.interval.set=%5B10000,20000%5D&q=accession:AZIV01000001&fq=annotation:PATRIC+AND+!(feature_type:source)&facet.mincount=1&rows=0&facet=true&wt=json

But that would require the specification of some 1,000 f.start.facet.interval.set as part of the query (at 10k from 0 to 10M).

Alternative 2:

Calculate the histogram as part of the indexing and store it as a field alongside the sequence information.

Alternative 3:

Simply enable the "start" field as a docValue and see if that improves performance. The english around it is fuzzy but this there are some indications that it might improve things.

http://wiki.apache.org/solr/DocValues

So I would try these as:

Alt 3, 1, 2

On 1/13/15 12:21 AM, Harry Yoo wrote: Oh, I thought you’re mentioning fetching features, but actually it was about histogram count.

Two thoughts,

As far as I remember, histogram query will be called only when the number of features in viewing window is greater than a certain limit. Depends on sequences, (or depends on request), it will show histogram first, but most likely it will display first 10,000 bp. The alternatives you suggested may address some cases, but I think the fetching part is more significant.
Interval Faceting is new to me. This type of approach can be done with JSON Facet API provided by heliosearch. Two possible benefits. 1) this may not require docValue, 2) JSON facet will utilize native code.

http://heliosearch.org/json-facet-api/

http://heliosearch.org/native-code-faceting/

On 1/13/15 11:33 PM, Andrew Warren wrote: So this won't be too obscure I will just jump right to the diagnosis/speculation:

Click this to see how fast Initial Jbrowse load should be

http://anwarren.vbi.vt.edu/portal/portal/patric/GenomeBrowser?cType=genome&cId=588858.6&loc=CP001363%3A1..10003&tracks=DNA%2CPATRICGenes%2CRefSeqGenes&highlight=

Every default Genome Browser page load is slow for the following reason:

Loading the Genome Browser page from the Organism landing page issues slow SOLR commands.

Specifically the server side java/jsp called by:

http://patricbrc.org/portal/portal/patric/GenomeBrowser/GBWindow?action=b&cacheability=PAGE&mode=getTrackInfo&accession=AZIV01000001&annotation=RefSeq (48.72 seconds)

http://patricbrc.org/portal/portal/patric/GenomeBrowser/GBWindow?action=b&cacheability=PAGE&mode=getTrackInfo&accession=AZIV01000001&annotation=PATRIC (48.68 seconds)

Each of these is creating two SOLR queries (4 queries total) [only refseq version shown]:

http://macleod:8080/solr/genome_feature/select?facet.range=start&q=accession:AZIV01000001&f.start.facet.range.end=10000000&f.start.facet.range.start=0&fq=annotation:RefSeq+AND+!(feature_type:source)&facet.mincount=1&rows=0&f.start.facet.range.gap=10000&facet=true (34 seconds)

http://macleod:8080/solr/genome_feature/select?q=accession:AZIV01000001+AND+annotation:RefSeq+AND+!(feature_type:source)&sort=start+asc&rows=10000 (108 ms)

The only difference in these two queries is the facet.range part. So that is the slow part. This begs the questions:

Q1: Do we need the histogram to start the genome browser?

Q2: Is it necessary to put these two queries in the same request to the server?

Details:

patric3_website/portal/patric-jbrowse/WebContent/data/trackList.jsp

is setup for two tracks and has as the urlTemplate the URLS from above

http://patricbrc.org/portal/portal/patric/GenomeBrowser/GBWindow?action=b&cacheability=PAGE&mode=getTrackInfo&accession=AZIV01000001&annotation=RefSeq (48.72 seconds)

http://patricbrc.org/portal/portal/patric/GenomeBrowser/GBWindow?action=b&cacheability=PAGE&mode=getTrackInfo&accession=AZIV01000001&annotation=PATRIC (48.68 seconds)

*****This in turn calls:

patric3_website/portal/patric-jbrowse/src/edu/vt/vbi/patric/portlets/GenomeBrowser.java

public void serveResource(ResourceRequest request, ResourceResponse response) throws PortletException, IOException

            case "getTrackInfo":

                    printTrackInfo(request, response);

                    break;

*****printTrackInfo calls the following

getFeatureCountHistogram

*****Which causes the histogram SOLR query.

http://macleod:8080/solr/genome_feature/select?facet.range=start&q=accession:AZIV01000001&f.start.facet.range.end=10000000&f.start.facet.range.start=0&fq=annotation:RefSeq+AND+!(feature_type:source)&facet.mincount=1&rows=0&f.start.facet.range.gap=10000&facet=true (34 seconds)

**Once you zoom out far enough the following finally gets issued. But that data is not used! :

http://patricbrc.org/portal/portal/patric/GenomeBrowser/GBWindow?action=b&cacheability=PAGE&mode=getHistogram&accession=CP001363&annotation=PATRIC&chunk=0

**_Some additional points_**

A solution requires changes to the track setup, and GenomeBrowser.java

SOLUTION: Load the histogram separately through one of two defined jbrowse mechanisms so that the initial page load isn't so slow.

Use either the regionFeatureDensities with HTMLFeatures http://gmod.org/wiki/JBrowse_Configuration_Guide#GET_.28base.29.2Fstats.2FregionFeatureDensities.2F.28refseq_name.29.3Fstart.3D123.26end.3D456.26basesPerBin.3D20000

OR

histograms.urlTemplate and histograms.storageClass with CanvasFeatures.

http://gmod.org/wiki/JBrowse_Configuration_Guide#Configuring_Summary_Histograms

Either way this allows us to separate the histogram load from the initial load (which we should do regardless of how we speed up the histogram request).

I was working on using regionFeatureDensities but ran into some problems with overloading baseURL competing with the parameters defined in patric-jbrowse/WebContent/WEB-INF/index.jsp

I have attached a diff (use Xcode MergeTool to view) for what I was trying relative to the https://github.com/cidvbi/patric3_website.git repo

On 1/14/15 3:56 PM, Rebecca Will wrote:

OK, being a TOTAL dweeb and doing this because I don't want to work on what I guess I SHOULD be working on ...

1) The genome browser is loading amazingly fast now - don't know if this is bc Dustin restarted Solr this morning (but he'd restarted Solr before and I haven't seen this genome browser load speed since the holidays) I tried it on several genomes and finally decided to try it on one that I didn't think would be cached.

ok, I kept trying examples LONG after I should have stopped.

4) For Listeriaceae bacterium TTU M1-001, it doesn't EVER shrink to histograms ... why? Probably because it's not long enough - although in bps it looks plenty long, but not on the genome browser - does this have something to do w/ contigs and it only does 1 contig at a time on the genome browser?

5) And I just keep finding things - Mycobacterium tuberculosis H37RvHA apparently doesn't have a RefSeq annotation - but we put up the marker for the RefSeq track anyway when we hit the Genome Browser page. OK, I'll enter this one on GitHub

On 1/14/15 5:36 PM, Andrew Warren wrote: If in the future we wish to separate the initial feature track load from the histogram information it will require solving the chicken and the egg problem of making the JSON histograms information present in the track info.

Look to getRegionFeatureDensities in the NCList.js Jbrowse code. and patric-jbrowse/src/edu/vt/vbi/patric/portlets/GenomeBrowser.java

-Andrew

mshukla1 commented 8 years ago

I am not sure if this is still considered a problem, and/or if there is a specific action item here, and if it falls into solr / data API / UI bucket.

Andrew?

aswarren commented 8 years ago

Closing this for now since it is no longer a performance hit.

PATRIC3 / patric3_website