Multicore task in a cluster

joaofrancafisica commented 2 years ago

Hello again!

I am trying to model a strong lensing system in a cluster using a slurm job sumbission script and it seems the flag number_of_cores in the Dynestic Static is not working properly. When I look at the multicore processes (htop) it shows PyAutoLens is running only in a single core even though number_of_cores > 1. Just for completeness, I runned this example script of emcee which makes use of the parallelization and everything looks fine (the multicore tasks seems to be distributed alongside the cores). I am sorry if I am missing something.

The script I am using:

import sys
import autolens as al
import autofit as af
import numpy as np
import json
import os
import time
import os

start_time = time.time()

np.random.seed(32)

pixel_scales = 10/120 #arcsec

zl = 0.43
zs = 0.92
#nc = 12
#exp_time = 200
#var_bkg = 2625735.0

# physical useful constants
# residual image
residual_image = al.Array2D.from_fits(file_path='./image_subtracted.fits', pixel_scales=pixel_scales, hdu=3)
#image = al.Array2D.from_fits(file_path='./image.fits', pixel_scales=pixel_scales, hdu=1)
#lens_brightness_model = al.Array2D.from_fits(file_path='./subtracted_image.fits', pixel_scales=pixel_scales, hdu=2)

# noise map
noise_map = al.Array2D.from_fits(file_path='./noisemap.fits', pixel_scales=pixel_scales, hdu=0)
#noise_map = np.sqrt(image*nc*exp_time + var_bkg)/(nc*exp_time)
# psf
psf = al.Kernel2D.from_fits(file_path='./psf.fits', pixel_scales=pixel_scales, hdu=0, normalize=True)
psf = al.preprocess.array_with_new_shape(array=psf, new_shape=(21, 21))
# imaging object
imaging = al.Imaging(image=residual_image, noise_map=noise_map, psf=psf)

# defining our mask
circ_mask = al.Mask2D.circular(shape_native=(120, 120), pixel_scales=pixel_scales, radius=2.7, centre=((69-60.5)*pixel_scales, (57.-60.5)*pixel_scales))
# reading an external mask
mask = al.Mask2D.from_fits(file_path='mask_model.fits', pixel_scales=pixel_scales)

combined_masks = mask + circ_mask
# applying our mask
imaging = imaging.apply_mask(mask=combined_masks)

positions = al.Grid2DIrregular(
    grid=[((74.385404-60.5)*pixel_scales, (73.8811-60.5)*pixel_scales),
        ((76.380381-60.5)*pixel_scales, (41.612878-60.5)*pixel_scales)]
)

positions_likelihood = al.PositionsLHPenalty(positions=positions, threshold=0.3)

# Step 0 - source parametric (source: Sersic, lens: SIE + shear)

# We define a bright object called bulge and combine it with our galaxy model. Here, we used a light sersic profile. If you want to read more about it, please visit https://en.wikipedia.org/wiki/Sérsic_profile
bulge = af.Model(al.lmp.EllSersic)
#bulge.elliptical_comps.elliptical_comps_0 = af.GaussianPrior(mean=0, sigma=0.2)
#bulge.elliptical_comps.elliptical_comps_1 = af.GaussianPrior(mean=0, sigma=0.2)
#bulge.centre.centre_0 = af.GaussianPrior(mean=0, sigma=0.5, lower_limit=-3., upper_limit=3.)
#bulge.centre.centre_1 = af.GaussianPrior(mean=0, sigma=0.5, lower_limit=-3., upper_limit=3.)

source_galaxy_model_0 = af.Model(al.Galaxy,
                                redshift=zs,
                                bulge=bulge)

mass = af.Model(al.mp.EllIsothermal)
#mass.elliptical_comps.elliptical_comps_0 = af.GaussianPrior(mean=0, sigma=0.2)
#mass.elliptical_comps.elliptical_comps_1 = af.GaussianPrior(mean=0, sigma=0.2)
#mass.centre.centre_0 = af.GaussianPrior(mean=0, sigma=1., lower_limit=-4., upper_limit=4.)
#mass.centre.centre_1 = af.GaussianPrior(mean=0, sigma=1., lower_limit=-4., upper_limit=4.)

shear = af.Model(al.mp.ExternalShear)
lens_galaxy_model_0 = af.Model(al.Galaxy,
                            redshift=zl,
                            mass=mass,
                            shear=shear)  

# Combining our galaxys into a single object
model_0 = af.Collection(galaxies=af.Collection(lens=lens_galaxy_model_0, source=source_galaxy_model_0))

# Lets start to fit data and model!
search_0 = af.DynestyStatic(path_prefix = './', # Prefix path of our results
                            name = 'source_parametric', # Name of the dataset
                            unique_tag = '0839_[1]', # File path of our results
                            nlive = 250,
                            number_of_cores=35) 

# analysis and results
analysis_0 = al.AnalysisImaging(dataset=imaging, positions_likelihood=positions_likelihood) # Passing our data through the search
result_0 = search_0.fit(model=model_0, analysis=analysis_0) # Finally fit our model

model_image = result_0.unmasked_model_image
model_image.output_to_fits('./source_parametric_model.fits', overwrite=True)

print("--- %s seconds ---" % (time.time() - start_time))

joaofrancafisica commented 2 years ago

Job submission script:

#!/bin/bash
##
## ntasks = quantidade de nucleos
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=35
#
## Nome do job. Aparece na saida do comando 'squeue'.
## E recomendado, mas nao necesssario, que o nome do job
#SBATCH --job-name=J1018

echo -e "\n## Job iniciado em $(date +'%d-%m-%Y as %T') #####################\n"

## O nome dos arquivos de input e output sao baseados no nome do job.
## Observe que nao e obrigatorio esta forma de nomear os arquivos.

## Informacoes do job impressos no arquivo de saida.
echo -e "\n## Jobs ativos de $USER: \n"
squeue -a -u $USER
echo -e "\n## Node de execucao do job: $(hostname -s) \n"
echo -e "\n## Numero de tarefas para este job: $SLURM_NTASKS \n"

## Execucao do software
module load python
python pipeline.py

echo -e "\n## Job finalizado em $(date +'%d-%m-%Y as %T') ###################"

Jammy2211 commented 2 years ago

Can you tell me what version of dynesty and autofit you are on?

pip show dynesty pip show autofit

Jammy2211 commented 2 years ago

Also the Python version.

I honesty have no idea, paralliezation is a nightmare.

For Emcee, did you follow the `Multiprocessing' or 'MPI' example?

joaofrancafisica commented 2 years ago

Thanks for your prompt answer!

Yeah, sure, here it is:

Python 3.9.12 Dynesty 1.0.1 AutoFit 2022.07.11.1

I tried the 'Multiprocessing' one. Do you think it could work using MPI?

Jammy2211 commented 2 years ago

Dynesty supports multprocessing, I was checking if it only worked with MPI.

Can you check the behaviour of an autofit Emcee fit, by changing:

search_0 = af.DynestyStatic(path_prefix = './', # Prefix path of our results
                            name = 'source_parametric', # Name of the dataset
                            unique_tag = '0839_[1]', # File path of our results
                            nlive = 250,
                            number_of_cores=35)

too:

search_0 = af.Emcee(path_prefix = './', # Prefix path of our results
                            name = 'source_parametric', # Name of the dataset
                            unique_tag = '0839_[1]', # File path of our results
                            number_of_cores=35)

This will inform me if its a dynesty specific issue or autofit specific issue.

joaofrancafisica commented 2 years ago

It seems to work with Emcee although the gain in performance was not what I was expecting:

My laptop 1 core: ~1000 seconds Cluster 1 core: ~1000 seconds Cluster 35 cores: ~500 seconds

Also, during the sampling, the cpus usage were about ~20%.

Jammy2211 commented 2 years ago

I will have a think.

It may be worth profiling 4 and 8 cores (for emcee and dynesty). When there are too many cores information passing can overwhelm the speed up and actually cause things to slow down.

joaofrancafisica commented 2 years ago

I think you are right. I tried to run with n_cores=4 and I was able to see the processes. It wasn't as fast as my laptop (694/825 sec) but I will make sure I have the same packages version. Thank you so much!

Jammy2211 commented 2 years ago

I think parallelization is probably working ok then, but that it just is not giving a huge speed up.

This is common for runs we do, with a ~x5 speed up across 25 cores (and often slower speed up for more cores). Packages like dynesty don't parallelize particularly efficiently unfortunately.

Your laptop being faster could be because its hardware runs PyAutoLens faster natively than the super computer -- this is not all that uncommon.

joaofrancafisica commented 2 years ago

20221013_011148 Yeah. I noticed that the numba version in the cluster was a bit different from the one in my laptop so I decided to export the environment. For low core numbers, the parallelization is working fine, roughly at the same speed. I will try a higher number again but that's unfortunate that it is not efficient in parallelizations.

Jammy2211 / PyAutoLens

Multicore task in a cluster #199