Closed wcwitt closed 1 year ago
@CheukHinHoJerry, I'm happy to help figure this out. But if it's only a problem for large lsq, are you certain it's not running out of memory? That would be my first guess.
Edit: just searched your error message and found this, which confirms memory could be the issue: https://stackoverflow.com/questions/46515249/meaning-of-julia-error-error-unhandled-task-failure-eoferror-read-end-of-fi.
Thank you for your prompted reply. The reason that why I guess it is not an OOM issue is that I checked with free_memory with:
julia> Sys.free_memory() / 2^20 / 1024
240.16326141357422
which has ~ 240 GB RAM on the server. Whereas my storage space has around 30GB available.
I am assembling a matrix of size (446365, 2773) which I expect to be ard of size 9.90 GB.
If there is a better way to check this could you please let me know? I would be thankful.
Update: I just kept printing Sys.free_memory()
during assembling and it seems that it actually soon OOM. Is this something expected or some problem with gc?
Sorry - I didn't see your update before. I have been seeing this recently as well on large datasets, and I don't know what could have changed. For me, it helps to add GC.gc()
after each subblock of the matrix is assembled - can you try that?
I added that the the mt assembly already and this seemed to solve my problem.
Yes. I tried that too and it works fine.
Yes. I tried that too and it works fine.
Do you mean it works with the distributed now, if you garbage collect? Or just with the mt?
(I do plan to merge some form of mt, just trying to get to the bottom.)
Sorry for the confusion. I tried that with mt only but didn't try with the multiprocess version.
Since you confirmed this was an OOM problem, and I remembered that I recently deleted a GC.gc()
in 54b7b2ed0a5ccb66820a799e43353f5979d68cfc, I think we have the answer. I've now restored the garbage collection in 65d5bc0.
Okay to close? We'll handle the mt elsewhere.
Sure - Thank you very much for your help!
I added that the the mt assembly already and this seemed to solve my problem.
@cortner when you said this, did you mean you added GC.gc()
to the mt assembly? If so, where? I'm finding it has a huge effect on performance in the mt case.
weird, I guess I must have this locally somewhere and forgot to push it.
I think what I did is similar to what you did for distributed. Does this cause performance issues for distributed or for mt? (it didn't for mt I think...)
I don't really understand how it can slow down performance, maybe another question for @tjjarvinen ??
One possible solution might be to GC only every 10 iterations or so?
I presume we are now talking about this code block that reads
A = SharedArray(zeros(rows[end][end], length(basis)))
Y = SharedArray(zeros(size(A, 1)))
W = SharedArray(zeros(size(A, 1)))
@info " - Beginning assembly with processor count: $(nprocs())."
@showprogress pmap(packets) do p
A[p.rows, :] .= feature_matrix(p.data, basis)
Y[p.rows] .= target_vector(p.data)
W[p.rows] .= weight_vector(p.data)
GC.gc()
end
This is a parallel reduction. The idea on the code seems to be to define SharedArrays A
, Y
and W
where reduction is performed. The issue is probably on how SharedArray handless random access from different processes. It probably creates multiple copies of itself which will lead to OOM-error.
Also, SharedArray is most likely not designed for reductions, so you could try something like this and see if it has the same problem
A = SharedArray(zeros(rows[end][end], length(basis)))
Y = SharedArray(zeros(size(A, 1)))
W = SharedArray(zeros(size(A, 1)))
assembly_parts = pmap(packets) do p
A = feature_matrix(p.data, basis)
Y = target_vector(p.data)
W = weight_vector(p.data)
( :rows => p.row, :A => A, :Y => Y, :W => W )
end
for part in assembly_parts
A[part.rows,:] .= part.A
Y[part.rows] .= part.Y
W[part.rows] .= part.W
end
If this solves the issue, then the garbage collection call just caused sync to be called with different instances of SharedArrays on different processes and the same should be achieved with a @sync
call.
We had the OOM also with multithreading, sonI doubt this is the issue. The question was not about the OOM but why calling GC after each assembly step might lead to significant slowdown. This surprises me very much.
why calling GC after each assembly step might lead to significant slowdown.
If it leads to @sync
call then it has to sync the array to all processes. This causes a lot of communication and thus causes a slowdown.
But GC on different processes would hardly require that?
But GC on different processes would hardly require that?
A
, Y
and W
are SharedArrays that need to be synced between processes. So, when one process updates one of them you need to sync. Julia probably buffers these updates, which causes the OOM-error (I am not sure of this). GC will then force a sync, which need to be carried to all processes.
I see - not completely but somewhat. It's also consistent with my observation that on a single node I don't seem to get a slowdown?
Chuck - worth a look I think?
Thanks - I generally agree with @tjjarvinen that a sync might be the culprit, but I don't experience the problem with distributed, only with threading. One difference is that in distributed each process gets its own garbage collector, whereas in multithreading there is just one.
I have an example for someone to try, but I'm realizing I shouldn't have hijacked this dead issue. I will post it here: https://github.com/ACEsuit/ACEfit.jl/pull/54.
Reported by @CheukHinHoJerry in https://github.com/ACEsuit/ACE1x.jl/issues/7.
I got this error multiple times with the ACEfit.assemble function with multiple workers for large lsq system and I remember there was an issue about this so I think it's better to post it here. It happens when I am in the middle of assembling the design matrix. This is the full error log:
Additionally:
It happens every time so it stops me from assembling a large lsq.