Version 2.3.0 is not working

mangla-sarvesh commented 1 year ago

I did "pip install AegeonTools" in ubuntu 18.4 to download the stable version to the tools and checked that the version was 2.3 When I was running the task "BANE filename.fits", it shows it is using 48 cores and when I checked, it is not using a single core. After 18 hours also it was showing the same things and not a single core/or a core.

Now I uninstall this version and downloaded version 2.2 which is working fine.

astrokatross commented 1 year ago

I've noticed the same issue. Some digging around and I belive its in the multiprocessing of BANE. I've set cores=1 and found it can run just fine in version 2.3.0

tjgalvin commented 1 year ago

I was playing around with this a little, and indeed there are some weird situations that are causing BANE to hang. Throwing some more logging in I can see that the Barrier.wait() is never satisfied:

(857, 1714):  Interpolating bkg to sharemem
(0, 857): barrier.parties=8 barrier.n_waiting=0 False
(4285, 5142):  Interpolating bkg to sharemem
(1714, 2571):  Interpolating bkg to sharemem
(5142, 5999): barrier.parties=8 barrier.n_waiting=1 False
(3428, 4285):  Interpolating bkg to sharemem
(2571, 3428):  Interpolating bkg to sharemem
(857, 1714): barrier.parties=8 barrier.n_waiting=2 False
(4285, 5142): barrier.parties=8 barrier.n_waiting=3 False
(1714, 2571): barrier.parties=8 barrier.n_waiting=4 False
(3428, 4285): barrier.parties=8 barrier.n_waiting=5 False
(2571, 3428): barrier.parties=8 barrier.n_waiting=6 False

In this case BANE will hang here waiting from some set of threads (presumably 2) to complete. There seems to be some relation to how the slicing happens

81972:DEBUG ymins [0, 666, 1332, 1998, 2664, 3330, 3996, 4662, 5328, 5994]
81972:DEBUG ymaxs [666, 1332, 1998, 2664, 3330, 3996, 4662, 5328, 5994, 6000]
Number of slices to cores len(ymaxs)=10 nslice=9 cores=9

In the above there are more ymaxs then requested slices, and in this case BANE will hang.

82064:DEBUG ymins [0, 750, 1500, 2250, 3000, 3750, 4500, 5250]
82064:DEBUG ymaxs [750, 1500, 2250, 3000, 3750, 4500, 5250, 6000]
Nmber of slices to cores len(ymaxs)=8 nslice=8 cores=8

In this case BANE will run fine.

I can seemingly get this pattern to continue -- when the len(ymaxs) is larger then cores it will hang. I am not sure exactly why. I think it is probably related to the number of rows available in the last slice, which is essentially how the width_y is calculated. When I force there to be a minimum of step_size[1] in the ymins and ymaxs at the cost of reducing nslices I can get things to work perfectly fine.

    if nslice > 1:
        redo_slice = True
        while redo_slice:
            # box widths should be multiples of the step_size, and not zero
            width_y = int(max(img_y/nslice/step_size[1], 1) * step_size[1])
            # width_y = int(max(img_y/(nslice-1)/step_size[1], 1) * step_size[1])

            # locations of the box edges
            ymins = list(range(0, img_y, width_y))
            # ymaxs = [ym + width_y for ym in ymins]
            ymaxs = list(range(width_y, img_y, width_y))
            ymaxs.append(img_y)

            if ymaxs[-1] - ymins[-1] < step_size[1]:
                logger.debug(f"Reducing {nslice=} {ymaxs[-1]=} {ymins[-1]=} {step_size[1]=}")
                nslice -= 1
                if nslice == 0:
                    raise ValueError("Stupod Tim")
            else:
                redo_slice = False

I hate that I wrote this. But it seems to be catching the cases that are prompting slices to hang. It is not clear to me if there is something wrong in the sigma_filter function that it is not being handled properly (overlapping slices trying to write to the same bit of the SharedArray), or whether the barrier does not like that there are more parties than cores involved in the pool (I would hope not).

That is all I have in me for tonight -- any thoughts @PaulHancock ?

tjgalvin commented 1 year ago

Playing with this with a fresh pair of eyes I think the problem is somewhere with how the Barrier object is being used and interacting with the number of cores made available to the pool. It is almost like when there are more threads than cores, some threads are not started. I am running this on my OSX system, so whether it is the same issue as others, who knows. For me though, running with

ctx = multiprocessing.get_context(method)
        barrier = ctx.Barrier(parties=len(ymaxs))
        pool = ctx.Pool(processes=len(ymaxs), maxtasksperchild=1,
                        initializer=init, initargs=(barrier, memory_id))

works perfectly fine (after removing all the code in my previous post). I ran a test with

for i in {1..20}
do
    BANE --debug --cores $i beam00_averaged_cal-MFS-I-image.fits
done

When using processess=len(ymaxs) instead of processes=cores everything works perfectly. Running the same test with the code as it is in master the hanging behavior bites me on the 6th iteration.

tjgalvin commented 1 year ago

I had the thought at lunch that this hanging behave has a simple explanation. Each sub-process spawned the by multi-processing pool is executing the sigma filter function. Inside this function there is a barrier.wait(), which will hold the process from any further execution until all participating get to the .wait().

In the processing based multi-processing pool, a process will only switch to a new set of input arguments in the .map_async() when it has finished its assigned work. In the case we have above where there are more pairs of ymins and ymaxs then cores (likely a bug in the derivation of width_y), there will be a deadlock, as the running processes and waiting and the pool is not starting any more processes to service the yet to be executed tasks.

tjgalvin commented 1 year ago

Hi @PaulHancock - just checking whether the above makes sense. Does it seem reasonable to you?

PaulHancock commented 1 year ago

Can confirm that this occurs whenever the number of cores requested is larger than the number of cores available. I think @tjgalvin is right about the the cause.

I used to have a set of barriers (one per image stripe) and a particular process would only need to wait for it's neighbours to finish before it proceeded. When I changed this so that there was a single global barrier we get the behaviour that you are all seeing now.

An immediate work around is to explicitly set the number of cores to be equal to or less than the number of cores available on your system. (This should be the default behaviour).

PaulHancock / Aegean

Version 2.3.0 is not working #198