clingen-data-model / genegraph

Presents an RDF triplestore of gene information using GraphQL APIs
5 stars 0 forks source link

Long duplication/deletion variations fail with timeout #698

Closed theferrit32 closed 1 year ago

theferrit32 commented 2 years ago

This expression times out with status 504: NC_000003.12:g.177772523_185716872dup

http://variation-normalization-dev-eb.us-east-2.elasticbeanstalk.com/variation/to_canonical_variation?q=NC_000003.12:g.177772523_185716872dup&fmt=hgvs&do_liftover=false&hgvs_dup_del_mode=default&untranslatable_returns_text=true

theferrit32 commented 2 years ago

Strings intended to be fail HGVS/SPDI detection and be returned as Text variation are also timing out.

Example: clinvar:478285

http://variation-normalization-dev-eb.us-east-2.elasticbeanstalk.com/variation/to_canonical_variation?q=clinvar:478285&fmt=hgvs&do_liftover=false&hgvs_dup_del_mode=default&untranslatable_returns_text=true

korikuzma commented 2 years ago

The large duplication is causing timeout errors due to the vrs-python normalize method.

So clinvar:478285 was likely not working due to timeout from above. It takes a while for a healthy instance to startup after it's deemed unhealthy. It is working this morning.

theferrit32 commented 2 years ago

@korikuzma do you know how long the service takes to recover after hitting a timeout exception on a request? Is there a way to have other backend threads respond even if one response thread crashed? Or to shorten the recovery window? Is that time just how long it takes a new instance to start from scratch?

korikuzma commented 2 years ago

@theferrit32 No, I'm not sure. Let me look into this issue.

korikuzma commented 2 years ago

@theferrit32 so I've updated the ASG triggers. After 5 minutes of an unhealthy state, a new instance will startup (took around 6 minutes to start up and deploy the app). So around 11 minutes total. I went with the default period of 5 minutes. If you need it sooner, I can look into costs and see if we're good to decrease it.

If possible, I think you should make checks to see if a variant has a large position changeNC_000003.12:g.177772523_185716872dup (especially duplications) and skip requests for them. I'm unfamiliar with the dataset, but it would probably be useful to also log these variants. At the next metakb meeting, we should discuss this further.

theferrit32 commented 2 years ago

@korikuzma thanks for looking into that and tweaking the configuration.

Until we get a better solution in the vrs-python code I will skip these variants for now. I don't think it's a large percentage so we will still be able to get most of the way there.

korikuzma commented 2 years ago

@theferrit32 Sounds good. The dev instance is ready for you to use whenever. Let me know if you encounter any other issues.

theferrit32 commented 2 years ago

Long deletions as well. Though this one may be improperly treated as absolute when it should be relative.

{:fn :normalize-canonical, :variation-expression "NC_000023.10:g.(?_60701)_(127527686_?)del", :expression-type :hgvs, :line 75} {:fn :normalize-absolute-copy-number, :msg "Error in VRS normalization request", :status 500, :response {:cached nil, :request-time 184, :repeatable? false, :protocol-version {:name "HTTP", :major 1, :minor 1}, :streaming? true, :http-client #object[org.apache.http.impl.client.InternalHttpClient 0x695eec9d "org.apache.http.impl.client.InternalHttpClient@695eec9d"], :chunked? false, :reason-phrase "Internal Server Error", :headers {"Date" "Tue, 08 Nov 2022 19:17:43 GMT", "Content-Type" "text/plain; charset=utf-8", "Content-Length" "21", "Connection" "close", "Server" "nginx/1.20.0"}, :orig-content-encoding nil, :status 500, :length 21, :body "Internal Server Error", :trace-redirects []}, :line 111}

theferrit32 commented 1 year ago

This may be "fixed" on our side by changes in https://github.com/clingen-data-model/genegraph/pull/747 that should convert dups longer than 50bp into RelativeCopyNumber instead of turning into Allele dups. I need to just check and verify that it works.

theferrit32 commented 1 year ago

Making a note of how to generate a test set of variation ids to filter to using a location extract table @larrybabb created. '%)x%' is used to identify which variants have a copy count(s) specified in the name. Might be more reliable to parse this out of the content field as well. In release 2023_02_26 there are 6 Duplication/Deletion variations with a variant_length >=50 whose name also matched '%)x%'.

select vl.id, v.name, vl.seq_location.variant_length as variant_length
from `clingen-stage.clinvar_2023_02_26_v1_6_57.var_loc_extract` vl
join `clingen-stage.clinvar_2023_02_26_v1_6_57.variation` v on v.id = vl.id
where
  vl.seq_location.variant_length >= 50
  and v.variation_type in ("Duplication" , "Deletion")
  and v.name not like "%)x%"
;
theferrit32 commented 1 year ago

Some long dups are not getting interpreted as copy number in genegraph. There is a warning logged on these. Resolving this might be done along with a refactor of the variation ingest/normalization steps that will make it easier to see what is happening and attach info to variants to describe why they got interpreted the way they did, or what errors happened.

theferrit32 commented 1 year ago

Just re-ran the expression NC_000003.12:g.177772523_185716872dup (length 7,944,350 bp) in the hgvs_to_copy_number_change endpoint using this configuration: https://github.com/theferrit32/variation-normalization/blob/bioutils-large-seqs/Dockerfile, and it finished in about 53 seconds, which is a bit long but is a significant improvement. If our snapshot pipeline we do parallelize the normalization, and it does an eager process-ahead, so if one variant takes 50 seconds to run, other worker threads will continue to work ahead in the list of variants. But it consumes the list of processed variants in the same order they started out in, so this keeps all those variants in memory until the slow variant finishes and the consumer can proceed.

Like with a threadpool of size two:

downstream consumer reads and drops from memory in order:
<variant1> (slow) <- thread 1 will sit here waiting for normalizer return
<variant2> (fast) <- thread 2 finishes task and proceeds
<variant3> (fast) <- thread 2 finishes task and proceeds (1-3 are in memory)
<variant4> (fast) <- thread 2 finishes task and proceeds (1-4 are in memory)
<variant5> (fast) <- thread 2 finishes task and proceeds (1-5 are in memory)
<variant6> (fast) <- thread 2 finishes task and proceeds (1-6 are in memory)
<variant7> (fast) <- thread 2 finishes task and proceeds (1-7 are in memory)
<variant8> (fast) <- thread 1 finished variant 1 and 1-6 (or 7 if it has finished) are cleared from memory

In this example variant 1 takes the same time to process as the subsequent 6-7 variants, but for example if a "fast" variant takes 10 milliseconds and a slower one takes 60 seconds, then the non-stuck thread will proceed 6000 variants ahead and those will all be held in memory until the slow one finishes. If there are more threads in the threadpool and slow variants are rare, this number gets higher. There are also some dup expressions that are longer than this, in the tens of millions bp.

theferrit32 commented 1 year ago

The updated bioutils is deployed in our cluster in this PR: https://github.com/clingen-data-model/architecture/pull/559

We had discussed adding another change to not normalize long copy number variants above a certain threshold. There are some in ClinVar into the 150 million bp range and they do take several minutes to normalize in vrs-python/bioutils and I'm not sure there's a way to really speed that up.

SELECT id, name, sequence_location,
  CAST(JSON_VALUE(sequence_location, '$."@variantLength"') as INT64) as variant_length
FROM 
(SELECT id, name, content_json,
 JSON_QUERY_ARRAY(content_json.Location.SequenceLocation) as sequence_locations
 FROM 
 (SELECT id, name, SAFE.PARSE_JSON(content) AS content_json
  FROM `clingen-stage.clinvar_2023_04_30_v1_6_58.variation`
  --LIMIT 10
  ))
CROSS JOIN UNNEST(sequence_locations) as sequence_location
ORDER BY variant_length DESC

top 5 output:

146075,GRCh38/hg38 7p22.3-q36.3(chr7:54185-159282390)x1,159228206
146075,GRCh38/hg38 7p22.3-q36.3(chr7:54185-159282390)x1,159020895
146075,GRCh38/hg38 7p22.3-q36.3(chr7:54185-159282390)x1,158618573
146764,GRCh38/hg38 Xp22.33-q28(chrX:10001-156030895)x1,156020895
144172,GRCh38/hg38 Xp22.33-q28(chrX:10679-156022826)x4,156012148
theferrit32 commented 1 year ago

Actually the ones listed above have a count specified and are treated as CopyNumberCount.

Filtering those out, the top 10 lengths are:

id,name,variant_length
144309,GRCh38/hg38 9p24.3-q34.3(chr9:204193-138179445),140869705
146857,GRCh38/hg38 Xp22.33-q25(chrX:10701-128393708),128383008
153303,GRCh38/hg38 Xp11.22-q24(chrX:50289384-119297604),69008221
146389,GRCh38/hg38 Yp11.32-q12(chrY:10701-57189762),59325213
151755,GRCh38/hg38 9p22.2-q21.11(chr9:18344605-68257015),50650619
560050,Single allele,39283068
155344,GRCh38/hg38 9p24.3-13.1(chr9:203861-38381642),38177782
155093,GRCh38/hg38 21q11.2-22.3(chr21:7749532-46677460),33711360
151749,GRCh38/hg38 18p11.32-11.21(chr18:136226-15175006),15038781
127268,NC_000021.7:g.13636378_28138533dup,14502156

query:

SELECT id, name, MAX(variant_length) as variant_length
FROM
(SELECT id, name,
  CAST(JSON_VALUE(sequence_location, '$."@variantLength"') as INT64) as variant_length
FROM 
(SELECT id, name, content_json,
 JSON_QUERY_ARRAY(content_json.Location.SequenceLocation) as sequence_locations
 FROM 
 (SELECT id, name, SAFE.PARSE_JSON(content) AS content_json
  FROM `clingen-stage.clinvar_2023_04_30_v1_6_58.variation`))
CROSS JOIN UNNEST(sequence_locations) as sequence_location)
WHERE name not like '%)x%'
GROUP BY id, name
ORDER BY variant_length DESC
LIMIT 10
theferrit32 commented 1 year ago

I may have underestimated how significant of an issue this would be: https://github.com/clingen-data-model/genegraph/issues/698#issuecomment-1536921234

This is the genegraph java process and the one normalizer worker process still doing work (out of 6 workers started). Genegraph is saturating all the cores of my laptop and is taking over 4GB of memory.

Screenshot 2023-05-08 at 2 04 36 PM

I'm not really sure what in the clojure code is taking all those CPU resources. Maybe garbage collection of a very very long list of objects. VisualVM did not respond after opening. jmap has been sitting hanging for several minutes.

jmap -dump:format=b,file=heap.jmap 3563

I may need to implement a different pmap function that does not retain completed items and instead yields them immediately to downstream code.

theferrit32 commented 1 year ago

jk claypoole provides an unordered pmap, I will try this first: https://cljdoc.org/d/org.clj-commons/claypoole/1.2.2/api/com.climate.claypoole#upmap

theferrit32 commented 1 year ago

jvm response to SIGTERM is lagging significantly, took like 20 seconds for the process to terminate, remained at 1000%+ core usage until the end. Reinforces the idea that it is stuck in garbage collection, not application code.

theferrit32 commented 1 year ago

Switching to claypoole/upmap seems to be working. With 12 threads in the pool, if any stall on a large variant, the others can proceed, and JVM memory is stable because finished task outputs are passed to the consumer immediately instead of in order. I don't have a heap limit set on the jvm right now but it's staying stable between 550MiB and 750MiB on its own. There are still some short pauses every so often that are either java garbage collection "pauses" pausing the sending of requests, or it's that all 6 worker processes in the local variation-normalizer on my laptop are stuck on large variants at the same time causing 12 client requests to block and the threadpool to stop receiving new tasks. Either way it's not really an issue. I would consider this resolved.