Open slamander opened 1 year ago
On another set of input data, I received the following stacktrace:
Stacktrace:
[1] wait
@ ./task.jl:345 [inlined]
[2] threading_run(fun::Omniscape.var"#161#threadsfor_fun#12"{Omniscape.var"#161#threadsfor_fun#11#13"{Int64, ProgressMeter.Progress, Int64, Dict{String, String}, Omniscape.ConditionLayers{Float64, 2}, Omniscape.Conditions, Omniscape.OmniscapeFlags, DataType, Dict{String, Int64}, UnitRange{Int64}}}, static::Bool)
@ Base.Threads ./threadingconstructs.jl:38
[3] macro expansion
@ ./threadingconstructs.jl:89 [inlined]
[4] run_omniscape(cfg::Dict{String, String}, resistance::Matrix{Union{Missing, Float64}}; reclass_table::Matrix{Union{Missing, Float64}}, source_strength::Matrix{Union{Missing, Float64}}, condition1::Matrix{Union{Missing, Float64}}, condition2::Matrix{Union{Missing, Float64}}, condition1_future::Matrix{Union{Missing, Float64}}, condition2_future::Matrix{Union{Missing, Float64}}, wkt::String, geotransform::Vector{Float64}, write_outputs::Bool)
@ Omniscape ~/.julia/packages/Omniscape/9gHf2/src/main.jl:257
[5] run_omniscape(path::String)
@ Omniscape ~/.julia/packages/Omniscape/9gHf2/src/main.jl:536
[6] top-level scope
@ /blue/scheffers/jbaecher/global_connectivity/julia_scripts/hpg_australia.jl:5
nested task error:
Progress: 26%|█████████████■| ETA: 0:56:30[K
signal (11): Segmentation fault
in expression starting at none:0
dgemv_kernel_4x4 at /apps/julia/1.8.2/bin/../lib/julia/libopenblas64_.so (unknown line)
dgemv_t_ZEN at /apps/julia/1.8.2/bin/../lib/julia/libopenblas64_.so (unknown line)
dgemv_64_ at /apps/julia/1.8.2/bin/../lib/julia/libopenblas64_.so (unknown line)
/tmp/slurmd/job61477046/slurm_script: line 24: 23882 Segmentation fault (core dumped) julia -p ${SLURM_CPUS_ON_NODE} julia_scripts/hpg_australia.jl
Mon Apr 10 15:58:11 EDT 2023
Sorry for the incredibly late reply. Oof, this one might be beyond me. If it's giving a core dump, then it could be that the only way to get to the bottom of it is to actually inspect that, but that's something I'm not well versed at. Did you ever get things working?
One thing you might test (even though it will of course take longer) would be to run it in serial. At least this way we could determine if the issue lies with multithreading.
Vincent,
No worries-I've not been able to make progress, and as a result decided to put this project on the backburner.
I will try running the job in serial and see if there are any changes. Will come back to this shortly.
From: Vincent Landau @.> Sent: Monday, September 4, 2023 4:30 PM To: Circuitscape/Omniscape.jl @.> Cc: Baecher,Joseph Alex @.>; Author @.> Subject: Re: [Circuitscape/Omniscape.jl] Segmentation fault in expression starting at none:0 (Issue #129)
[External Email]
Sorry for the incredibly late reply. Oof, this one might be beyond me. If it's giving a core dump, then it could be that the only way to get to the bottom of it is to actually inspect that, but that's something I'm not well versed at. Did you ever get things working?
One thing you might test (even though it will of course take longer) would be to run it in serial. At least this way we could determine if the issue lies with multithreading.
- Reply to this email directly, view it on GitHubhttps://github.com/Circuitscape/Omniscape.jl/issues/129#issuecomment-1705679688, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANPYAFG3Q5TBNHKTYNASKALXYY257ANCNFSM6AAAAAAW37AH7U. You are receiving this because you authored the thread.Message ID: @.**@.>>
Hi OS community,
I've been troubleshooting my analyses for some time now--getting similar issues to those described in this issue section (e.g., "task failed on specific row and column", and "AssertionError: norm(G * v .- curr) / norm(curr) < 1.0e-6").
Now that I've found a solution to these errors, my analyses are finally running longer than before--in some cases up to 90%. But now half of them are failing with the message below, and creating a core dump. I've found some mention of this on the general Julia boards, with talk of rather complicated memory allocation issues (I couldn't make any sense of it). I'm running these analyses on a computing cluster (4 cpus, 32gb ram / cpu).
Here's the full stack trace: