PavlidisLab / Gemma

Genomics data re-analysis
Apache License 2.0
21 stars 6 forks source link

/datasets/analyses/differential/results/genes/{gene} fails on staging #1179

Closed oganm closed 3 weeks ago

oganm commented 1 month ago

The endpoint currently fails on staging. putting it here in case it's a regression

Staging

curl -I --header "Accept: text/tab-separated-values" https://staging-gemma.msl.ubc.ca/rest/v2/datasets/analyses/differential/results/genes/13807?&threshold=1
HTTP/1.1 502 Proxy Error
Date: Fri, 05 Jul 2024 03:36:00 GMT
Server: Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips mod_qos/11.70 PHP/5.4.16 mod_perl/2.0.11 Perl/v5.16.3
Strict-Transport-Security: max-age=63072000
Content-Type: text/html; charset=iso-8859-1

Production

curl -I --header "Accept: text/tab-separated-values" https://gemma.msl.ubc.ca/rest/v2/datasets/analyses/differential/results/genes/13807?&threshold=1
HTTP/1.1 200 
Date: Fri, 05 Jul 2024 03:32:45 GMT
Server: Apache/2.4.6 (CentOS) OpenSSL/1.0.2k-fips mod_qos/11.70 PHP/5.4.16
Strict-Transport-Security: max-age=63072000
Content-Encoding: gzip
Content-Type: text/tab-separated-values;charset=utf-8
Set-Cookie: JSESSIONID=05D757D53A836C6F11793A8CC24CEEE2; Path=/; Secure; HttpOnly
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
oganm commented 1 month ago

Now fails on dev too. It appears that the failure happens somewhat sporadically with large queries not using any filters

arteymix commented 1 month ago

Yes, it's due to too many results for that specific gene. I will investigate if we can speed up the query.

arteymix commented 3 weeks ago

It looks like I can save quite a lot of time by not initializing factor values. I'll make those lazy in ContrastResult and will add a special case for the TSV output that does not render their measurement/characteristics.

arteymix commented 3 weeks ago

I've also sped up queries by separating contrasts and probes initialization. That allows Hibernate to perform batch initialization more efficiently since it does not interleave queries.

arteymix commented 3 weeks ago

More optimization on the way by only fetching IDs for baselines and factor values, we only display IDs in the TSV output so we can skip that altogether. I've made ContrastResult's factorValue and secondFactorValue lazily-initialized.

arteymix commented 3 weeks ago

The only remaining bottleneck is a SQL query that fetches the baselines:

select expression0_.ID as col_0_0_, factorvalu3_.EXPERIMENTAL_FACTOR_FK as col_1_0_, factorvalu4_.EXPERIMENTAL_FACTOR_FK as col_2_0_ from ANALYSIS_RESULT_SET expression0_ inner join DIFFERENTIAL_EXPRESSION_ANALYSIS_RESULT results1_ on expression0_.ID=results1_.RESULT_SET_FK inner join CONTRAST_RESULT contrasts2_ on results1_.ID=contrasts2_.DIFFERENTIAL_EXPRESSION_ANALYSIS_RESULT_FK inner join FACTOR_VALUE factorvalu3_ on contrasts2_.FACTOR_VALUE_FK=factorvalu3_.ID inner join FACTOR_VALUE factorvalu4_ on contrasts2_.SECOND_FACTOR_VALUE_FK=factorvalu4_.ID where expression0_.class='ExpressionAnalysisResultSet'

I'm a bit puzzled as to why it is so slow.

arteymix commented 3 weeks ago

I'm going to try to create an index on both first and second FVs. I think it should allow MySQL to quickly find contrasts that have both FKs set.

arteymix commented 3 weeks ago

Yay! Just got it under 40 seconds! It should be much faster once deployed.

arteymix commented 3 weeks ago

Should be completely addressed by 70393b740530c0948a7b289dcc18c4aad2897ea2.

arteymix commented 3 weeks ago

Now, with a573852edeb1aecd346c524268585293e1d56833 it's even faster! I managed to bypass the inefficient treatment of jointures with the results and contrasts.