cbyrohl / scida

scida is an out-of-the-box analysis tool for large scientific datasets. It primarily supports the astrophysics community, focusing on cosmological and galaxy formation simulations using particles or unstructured meshes, as well as large observational datasets. This tool uses dask, allowing analysis to scale.
https://scida.io
MIT License
28 stars 4 forks source link

Question: Tips needed to make the code faster #67

Closed astronerdF closed 1 year ago

astronerdF commented 1 year ago

Hi! This is not a bug in the code. I have a task for which I have prepared a code to the best of my abilities. I want to know if there is any way to make it run faster.

I want to get the data of gas particles within a simulation snapshot that exists outside the R200_crit distance from the center of each halo. In other words, I want to separate the gas that exists outside the halo vs inside the halo. The code that I am using to do that for 1 halo is:

basePath = "/virgotng/universe/IllustrisTNG/TNG100-3/output/"
ds = load(basePath+"/snapdir_099",units=False)
gas99 = ds.data["gas"] 
group = ds["Group"]
groupPos = group["GroupPos"][0]
R200_crit = group["Group_R_Crit200"][0]
# Getting the halo gas data from the subset at 3 R_200crit
# Calculate the vector differences between each gas particle position and the groupPos
delta = gas99["Coordinates"] - groupPos

# Calculate the Euclidean distance for each particle from the groupPos (we use axis=1 as the coordinates are 3D)
distances = np.sqrt(np.sum(delta**2, axis=1))

# Identify the particles within 3 times R200_crit from the groupPos
mask = distances <= 3 * R200_crit

# Select the particles that meet the condition
#particles_within_R200 = gas99[mask.compute()]
particles_within_R200 = {key: gas99[key][mask] for key in gas99.keys()}

On the vera cluster, it took roughly 4.4 seconds for the first halo. For all the halos in the simulation, it could take a very long time.

This is my code to remove the halo gas:

from tqdm import tqdm

ds = load(basePath+"/snapdir_099",units=False)
gas99 = ds.data["gas"] 
group = ds["Group"]
groupPos = group["GroupPos"]
R200_crit = group["Group_R_Crit200"]

def calculate_mask(i, mask, pos, R200_crit, coords):
    # Calculate the vector differences between each gas particle position and the pos
    delta = coords - pos

    # Calculate the Euclidean distance for each particle from the pos (we use axis=1 as the coordinates are 3D)
    distances = np.sqrt(np.sum(delta**2, axis=1))

    # Identify the particles within R200_crit from the pos
    halo_mask = distances < R200_crit

    # Update the mask to exclude these particles
    updated_mask = mask & ~halo_mask

    return updated_mask

# Start with a mask that includes all particles
mask = np.ones(len(gas["Coordinates"]), dtype=bool)

# Calculate masks for each groupPos and R200_crit
for i in tqdm(range(len(groupPos)), desc='Calculating masks'):
    mask = calculate_mask(i, mask, groupPos[i], R200_crits[i], gas["Coordinates"])

# At this point, the mask should include only the particles that are outside all halos
particles_outside_halos = {key: gas[key][mask] for key in ["CenterOfMass","Coordinates","Masses","Density","GFM_Metallicity","GroupID","ParticleIDs","Velocities"]}

I tried parallelizing the code as well as turning some of the arrays into numpy arrays, but it did not work and somehow was slower.

cbyrohl commented 1 year ago

This is going to be hard without another spatial ordering, i.e. using scipy's KDTree. However that means you are leaving the dask framework.

Why specifically masking with Rvir200? If you would be okay with masking "non-halo" particles (i.e. those not associated with any halo), then this task would be done immediately as the particle ordering allows for that.

astronerdF commented 1 year ago

How to mask with the non-halo particles? Right now, I need the average metallicity of the IGM gas, but in the future I need the mass inflow rate of gas particles within the Rvir200.

At the same time, what defines whether a particle belongs to a halo or not?

cbyrohl commented 1 year ago

Could you update the repository and try data = ds.return_data(unbound=True) to obtain all unbound particles?

This should be the most efficient way because it only reads memory that holds unbound particles. I recommend using this method (if it works). In theory, you could also use masking with d["PartType0"]["GroupID"] == 9223372036854775807. (The groupID is set to max_int64_val for unbound gas).

cbyrohl commented 1 year ago

TBD further offline.

cbyrohl commented 1 year ago

This should be the most efficient way because it only reads memory that holds unbound particles. I recommend using this method (if it works). In theory, you could also use masking with d["PartType0"]["GroupID"] == 9223372036854775807. (The groupID is set to max_int64_val for unbound gas).

Note: max_int64_val is set for unbound gas right now due to an issue with dask, it might eventually change back to -1 in the future. Use "ds.misc['unboundID']" for the ID of unbound gas to stay consistent in the future.