Closed theferrit32 closed 1 year ago
Strings intended to be fail HGVS/SPDI detection and be returned as Text
variation are also timing out.
Example: clinvar:478285
The large duplication is causing timeout errors due to the vrs-python normalize method.
So clinvar:478285
was likely not working due to timeout from above. It takes a while for a healthy instance to startup after it's deemed unhealthy. It is working this morning.
@korikuzma do you know how long the service takes to recover after hitting a timeout exception on a request? Is there a way to have other backend threads respond even if one response thread crashed? Or to shorten the recovery window? Is that time just how long it takes a new instance to start from scratch?
@theferrit32 No, I'm not sure. Let me look into this issue.
@theferrit32 so I've updated the ASG triggers. After 5 minutes of an unhealthy state, a new instance will startup (took around 6 minutes to start up and deploy the app). So around 11 minutes total. I went with the default period of 5 minutes. If you need it sooner, I can look into costs and see if we're good to decrease it.
If possible, I think you should make checks to see if a variant has a large position changeNC_000003.12:g.177772523_185716872dup
(especially duplications) and skip requests for them. I'm unfamiliar with the dataset, but it would probably be useful to also log these variants. At the next metakb meeting, we should discuss this further.
@korikuzma thanks for looking into that and tweaking the configuration.
Until we get a better solution in the vrs-python code I will skip these variants for now. I don't think it's a large percentage so we will still be able to get most of the way there.
@theferrit32 Sounds good. The dev instance is ready for you to use whenever. Let me know if you encounter any other issues.
Long deletions as well. Though this one may be improperly treated as absolute when it should be relative.
{:fn :normalize-canonical, :variation-expression "NC_000023.10:g.(?_60701)_(127527686_?)del", :expression-type :hgvs, :line 75}
{:fn :normalize-absolute-copy-number, :msg "Error in VRS normalization request", :status 500, :response {:cached nil, :request-time 184, :repeatable? false, :protocol-version {:name "HTTP", :major 1, :minor 1}, :streaming? true, :http-client #object[org.apache.http.impl.client.InternalHttpClient 0x695eec9d "org.apache.http.impl.client.InternalHttpClient@695eec9d"], :chunked? false, :reason-phrase "Internal Server Error", :headers {"Date" "Tue, 08 Nov 2022 19:17:43 GMT", "Content-Type" "text/plain; charset=utf-8", "Content-Length" "21", "Connection" "close", "Server" "nginx/1.20.0"}, :orig-content-encoding nil, :status 500, :length 21, :body "Internal Server Error", :trace-redirects []}, :line 111}
This may be "fixed" on our side by changes in https://github.com/clingen-data-model/genegraph/pull/747 that should convert dups longer than 50bp into RelativeCopyNumber instead of turning into Allele dups. I need to just check and verify that it works.
Making a note of how to generate a test set of variation ids to filter to using a location extract table @larrybabb created. '%)x%'
is used to identify which variants have a copy count(s) specified in the name. Might be more reliable to parse this out of the content field as well. In release 2023_02_26 there are 6 Duplication
/Deletion
variations with a variant_length >=50
whose name also matched '%)x%'
.
select vl.id, v.name, vl.seq_location.variant_length as variant_length
from `clingen-stage.clinvar_2023_02_26_v1_6_57.var_loc_extract` vl
join `clingen-stage.clinvar_2023_02_26_v1_6_57.variation` v on v.id = vl.id
where
vl.seq_location.variant_length >= 50
and v.variation_type in ("Duplication" , "Deletion")
and v.name not like "%)x%"
;
Some long dups are not getting interpreted as copy number in genegraph. There is a warning logged on these. Resolving this might be done along with a refactor of the variation ingest/normalization steps that will make it easier to see what is happening and attach info to variants to describe why they got interpreted the way they did, or what errors happened.
Just re-ran the expression NC_000003.12:g.177772523_185716872dup
(length 7,944,350 bp) in the hgvs_to_copy_number_change
endpoint using this configuration: https://github.com/theferrit32/variation-normalization/blob/bioutils-large-seqs/Dockerfile, and it finished in about 53 seconds, which is a bit long but is a significant improvement. If our snapshot pipeline we do parallelize the normalization, and it does an eager process-ahead, so if one variant takes 50 seconds to run, other worker threads will continue to work ahead in the list of variants. But it consumes the list of processed variants in the same order they started out in, so this keeps all those variants in memory until the slow variant finishes and the consumer can proceed.
Like with a threadpool of size two:
downstream consumer reads and drops from memory in order:
<variant1> (slow) <- thread 1 will sit here waiting for normalizer return
<variant2> (fast) <- thread 2 finishes task and proceeds
<variant3> (fast) <- thread 2 finishes task and proceeds (1-3 are in memory)
<variant4> (fast) <- thread 2 finishes task and proceeds (1-4 are in memory)
<variant5> (fast) <- thread 2 finishes task and proceeds (1-5 are in memory)
<variant6> (fast) <- thread 2 finishes task and proceeds (1-6 are in memory)
<variant7> (fast) <- thread 2 finishes task and proceeds (1-7 are in memory)
<variant8> (fast) <- thread 1 finished variant 1 and 1-6 (or 7 if it has finished) are cleared from memory
In this example variant 1 takes the same time to process as the subsequent 6-7 variants, but for example if a "fast" variant takes 10 milliseconds and a slower one takes 60 seconds, then the non-stuck thread will proceed 6000 variants ahead and those will all be held in memory until the slow one finishes. If there are more threads in the threadpool and slow variants are rare, this number gets higher. There are also some dup expressions that are longer than this, in the tens of millions bp.
The updated bioutils is deployed in our cluster in this PR: https://github.com/clingen-data-model/architecture/pull/559
We had discussed adding another change to not normalize long copy number variants above a certain threshold. There are some in ClinVar into the 150 million bp range and they do take several minutes to normalize in vrs-python/bioutils and I'm not sure there's a way to really speed that up.
SELECT id, name, sequence_location,
CAST(JSON_VALUE(sequence_location, '$."@variantLength"') as INT64) as variant_length
FROM
(SELECT id, name, content_json,
JSON_QUERY_ARRAY(content_json.Location.SequenceLocation) as sequence_locations
FROM
(SELECT id, name, SAFE.PARSE_JSON(content) AS content_json
FROM `clingen-stage.clinvar_2023_04_30_v1_6_58.variation`
--LIMIT 10
))
CROSS JOIN UNNEST(sequence_locations) as sequence_location
ORDER BY variant_length DESC
top 5 output:
146075,GRCh38/hg38 7p22.3-q36.3(chr7:54185-159282390)x1,159228206
146075,GRCh38/hg38 7p22.3-q36.3(chr7:54185-159282390)x1,159020895
146075,GRCh38/hg38 7p22.3-q36.3(chr7:54185-159282390)x1,158618573
146764,GRCh38/hg38 Xp22.33-q28(chrX:10001-156030895)x1,156020895
144172,GRCh38/hg38 Xp22.33-q28(chrX:10679-156022826)x4,156012148
Actually the ones listed above have a count specified and are treated as CopyNumberCount
.
Filtering those out, the top 10 lengths are:
id,name,variant_length
144309,GRCh38/hg38 9p24.3-q34.3(chr9:204193-138179445),140869705
146857,GRCh38/hg38 Xp22.33-q25(chrX:10701-128393708),128383008
153303,GRCh38/hg38 Xp11.22-q24(chrX:50289384-119297604),69008221
146389,GRCh38/hg38 Yp11.32-q12(chrY:10701-57189762),59325213
151755,GRCh38/hg38 9p22.2-q21.11(chr9:18344605-68257015),50650619
560050,Single allele,39283068
155344,GRCh38/hg38 9p24.3-13.1(chr9:203861-38381642),38177782
155093,GRCh38/hg38 21q11.2-22.3(chr21:7749532-46677460),33711360
151749,GRCh38/hg38 18p11.32-11.21(chr18:136226-15175006),15038781
127268,NC_000021.7:g.13636378_28138533dup,14502156
query:
SELECT id, name, MAX(variant_length) as variant_length
FROM
(SELECT id, name,
CAST(JSON_VALUE(sequence_location, '$."@variantLength"') as INT64) as variant_length
FROM
(SELECT id, name, content_json,
JSON_QUERY_ARRAY(content_json.Location.SequenceLocation) as sequence_locations
FROM
(SELECT id, name, SAFE.PARSE_JSON(content) AS content_json
FROM `clingen-stage.clinvar_2023_04_30_v1_6_58.variation`))
CROSS JOIN UNNEST(sequence_locations) as sequence_location)
WHERE name not like '%)x%'
GROUP BY id, name
ORDER BY variant_length DESC
LIMIT 10
I may have underestimated how significant of an issue this would be: https://github.com/clingen-data-model/genegraph/issues/698#issuecomment-1536921234
This is the genegraph java process and the one normalizer worker process still doing work (out of 6 workers started). Genegraph is saturating all the cores of my laptop and is taking over 4GB of memory.
I'm not really sure what in the clojure code is taking all those CPU resources. Maybe garbage collection of a very very long list of objects. VisualVM did not respond after opening. jmap has been sitting hanging for several minutes.
jmap -dump:format=b,file=heap.jmap 3563
I may need to implement a different pmap
function that does not retain completed items and instead yields them immediately to downstream code.
jk claypoole provides an unordered pmap, I will try this first: https://cljdoc.org/d/org.clj-commons/claypoole/1.2.2/api/com.climate.claypoole#upmap
jvm response to SIGTERM is lagging significantly, took like 20 seconds for the process to terminate, remained at 1000%+ core usage until the end. Reinforces the idea that it is stuck in garbage collection, not application code.
Switching to claypoole/upmap
seems to be working. With 12 threads in the pool, if any stall on a large variant, the others can proceed, and JVM memory is stable because finished task outputs are passed to the consumer immediately instead of in order. I don't have a heap limit set on the jvm right now but it's staying stable between 550MiB and 750MiB on its own. There are still some short pauses every so often that are either java garbage collection "pauses" pausing the sending of requests, or it's that all 6 worker processes in the local variation-normalizer on my laptop are stuck on large variants at the same time causing 12 client requests to block and the threadpool to stop receiving new tasks. Either way it's not really an issue. I would consider this resolved.
This expression times out with status 504:
NC_000003.12:g.177772523_185716872dup
http://variation-normalization-dev-eb.us-east-2.elasticbeanstalk.com/variation/to_canonical_variation?q=NC_000003.12:g.177772523_185716872dup&fmt=hgvs&do_liftover=false&hgvs_dup_del_mode=default&untranslatable_returns_text=true