genenetwork / genenetwork2

GeneNetwork (2nd generation)
http://gn2.genenetwork.org/
GNU Affero General Public License v3.0
34 stars 24 forks source link

502 errors #278

Closed pjotrp closed 6 years ago

pjotrp commented 6 years ago

GN2 gives 502 errors when timing out. This can be reproduced today with long running: http://gn2.genenetwork.org/show_trait?trait_id=1433387_at&dataset=HC_M2_0606_P

and hit correlations with default values. It renders a

[2018-02-11 10:10:08 +0000] [4974] [CRITICAL] WORKER TIMEOUT (pid:13769)

after about a minute. I tried replacing sync workers with gevent and eventlet - and it makes no difference. It appears the problem is that we are running external processes which gunicorn can not track.

pjotrp commented 6 years ago

eventlet worker model is running on production. Appears to improve things.

pjotrp commented 6 years ago

When nginx does not get a response from upstream it 502's. In the current setup we should get less of those. I just did a 6 minute global search that did not bomb out: http://gn2.genenetwork.org/search?species=human&group=GTEx_v5&type=Cervix+mRNA&dataset=GTEXv5_CerEct_0915&search_terms_or=&search_terms_and=*&FormID=searchResult

Please try things like this yourself.

robwwilliams commented 6 years ago

Checking using monster large exon array data set: and using probe set 4684870 (index 1 return). This is a great data set to stress test any correlation method.

[image: Inline image 1]

RUNNING TEST Start time is 7:51 AM: top 500 correlations 7:54 all is still apparently good 8:00 all good. Started independent GN2 queries and they are running file 8:04 process still running: no progress bar and at this point 90% of users will have assumed we crashed a process 8:08 coming up on 18 minutes. This is what the user sees.

[image: Inline image 2] Hmm, can a process like this get a "title" in a browser window, like "GN Correlation in Progress: IDXXXX" ?

8:12 AM: Perfect: I got it the error message right on time 502 Bad Gateway

nginx/1.4.1

I suspect in this case the calculation would have completed, but just too damn slow on GN2.

NOTE: Running same request on GN1 (EC2 instance): Results in 140.2 seconds. At least 10X faster than GN2.

[image: Inline image 3]

The reason is that GN1 code is optimized to handle massive arrays of data (case-by-expression) using a text file dump of the "ProbeSetFreeze" rather than direct use of MySQL tables. The correlation calculation was also rewritten (as I recall) in C by "David Kroll" (if you want grep).

I don't think GN2 knows about our text file dumps. We did "break" this system when we moved GN1 from Lilly to EC2 about 3 months ago, but Lei then fixed this pretty quickly by moving all of the files into EC2. Code now can find the file and work 10X faster. Probably not hard to implement this in GN2. Just speeding up compute won't help because MySQL or any RDB will be too damn slow to fetch data.

On Sun, Feb 11, 2018 at 5:42 AM, Pjotr Prins notifications@github.com wrote:

When nginx does not get a response from upstream it 502's. In the current setup we should get less of those. I just did a 6 minute global search that did not bomb out: http://gn2.genenetwork.org/search?species=human&group= GTEx_v5&type=Cervix+mRNA&dataset=GTEXv5_CerEct_0915& search_terms_or=&search_terms_and=*&FormID=searchResult

Please try things like this yourself.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/genenetwork/genenetwork2/issues/278#issuecomment-364745209, or mute the thread https://github.com/notifications/unsubscribe-auth/ALva_Gf9qc5rG-2l7aOmGWiIZ96q_mawks5tTtIhgaJpZM4SBSOh .

-- Rob

Robert W. Williams, Ph.D. Chair: Department of Genetics, Genomics and Informatics 71 S Manassas St, Memphis TN 38163 University of Tennessee Health Science Center Office 901 448-7050 CELL 901 604 4752 Office: Translational Science Research Building, Room 407 EMAIL: rwilliams@uthsc.edu Alternative email: labwilliams@gmail.com SKYPE: robwwilliams

robwwilliams commented 6 years ago

On Sun, Feb 11, 2018 at 08:30:53AM -0600, Rob Williams wrote:

Checking using monster large exon array data set: and using probe set 4684870 (index 1 return). This is a great data set to stress test any correlation method.

Absolutely. Interestingly the server has no load and the error log looks like:

ERROR:wqflask.views:.show_trait_page: 13:51:49 UTC 20180211: u'http://gn2.genenetwork.org/show_trait?trait_id=4684870&dataset=UMUTAffyExon_0209
_RMA'
ERROR:wqflask.views:.index_page: 13:51:53 UTC 20180211: u'http://gn2.genenetwork.org/'
ERROR:wqflask.views:.corr_compute_page: 13:51:58 UTC 20180211: u'http://gn2.genenetwork.org/corr_compute'
INFO:utility.tools:Found: file /home/zas1024/genotype_files/genotype/BXD.geno

ERROR:wqflask.views:.submit_trait_form: 13:52:27 UTC 20180211: u'http://gn2.genenetwork.org/submit_trait'
ERROR:wqflask.views:.help: 13:52:55 UTC 20180211: u'http://gn2.genenetwork.org/help'
ERROR:wqflask.views:.index_page: 13:53:49 UTC 20180211: u'http://gn2.genenetwork.org/'
ERROR:wqflask.views:.index_page: 13:55:44 UTC 20180211: u'http://gn2.genenetwork.org/'
ERROR:wqflask.views:.index_page: 13:55:49 UTC 20180211: u'http://gn2.genenetwork.org/'
ERROR:wqflask.views:.index_page: 13:57:49 UTC 20180211: u'http://gn2.genenetwork.org/'
ERROR:wqflask.views:.submit_trait_form: 13:57:52 UTC 20180211: u'http://gn2.genenetwork.org/submit_trait'
ERROR:wqflask.views:.show_temp_trait_page: 13:58:18 UTC 20180211: u'http://gn2.genenetwork.org/show_temp_trait'
ERROR:wqflask.views:.handle_bad_request: 13:58:18 UTC 20180211: could not convert string to float: ZNF77
ERROR:wqflask.views:.handle_bad_request: 13:58:18 UTC 20180211: u'http://gn2.genenetwork.org/show_temp_trait'
ERROR:wqflask.views:.handle_bad_request: 13:58:18 UTC 20180211: Traceback (most recent call last):
    File "/usr/local/guix-profiles/gn2-2.11rc2/lib/python2.7/site-packages/flask/app.py", line 1639, in full_dispatch_request
        rv = self.dispatch_request()
    File "/usr/local/guix-profiles/gn2-2.11rc2/lib/python2.7/site-packages/flask/app.py", line 1625, in dispatch_request
        return self.view_functions[rule.endpoint](**req.view_args)
    File "/home/production/gene/wqflask/wqflask/views.py", line 416, in show_temp_trait_page
        template_vars = show_trait.ShowTrait(request.form)
    File "/home/production/gene/wqflask/wqflask/show_trait/show_trait.py", line 151, in __init__
        self.make_sample_lists()
    File "/home/production/gene/wqflask/wqflask/show_trait/show_trait.py", line 317, in make_sample_lists
        header="%s Only" % (self.dataset.group.name))
    File "/home/production/gene/wqflask/wqflask/show_trait/SampleList.py", line 44, in __init__
        sample = webqtlCaseData.webqtlCaseData(name=sample_name, value=float(self.this_trait[counter-1]))
ValueError: could not convert string to float: ZNF77

So it bombs out but never returns!!

Pj.

pjotrp commented 6 years ago

Added issue #284. So both above have their own issue now. This one is for the recurring 502's