Ergonomic improvements to concatenation scripts

mabruzzo commented 6 days ago

This introduces a few changes to the concatenation scripts. These are all surface-level changes (they don't affect the output format -- that is still something I want to address).

There are essentially 3 changes:

first, I introduced a new script called concat.py that can be used to concatenate more than one kind of dataset at a time (3D-cubes or multiple kinds of 2D datasets).
next, I replaced --concat-outputs with --snaps (this is a subjective improvement)
- The main differenece has to do with specifying ranges of snapshots with --snaps is of the form START:STOP[:STEP] (just like a python slice). For comparison, you use would specify a range with FIRST-LAST with --concat-outputs (LAST is included in the range)
- You can still use --concat-outputs, but you get a deprecation warning
- I definitely want to hear if people dislike this approach
finally, I made it possible to automatically detect the number of files that need to be concatenated. You can still specify --num-processes, but you get a warning about it.

I also haven't really touched the particle-concatentation script. Based on how code is shared, change number 2 will affect that script. But propagating change 3 would take some additional work.

I'm definitely open to any kind of feedback!

In a subsequent PR, it's my intention to modify things so that we can optionally use multiprocessing or mpi4py to speed up concatenation (I have an idea on how to do that in a minimally invasive manner that won't affect hypothetical dask-usage).

evaneschneider commented 5 days ago

This all seems fine to me, but I'd like @helenarichie or @bcaddy to take a look, since I think they are the main ones who have been using the scripts in their existing form.

helenarichie commented 5 days ago

I think this looks good! I like having the default behavior be that you don't have to specify the number of processes. It might be nice to also do that with the snapshot numbers at some point. I also don't have any issues with changing the way the snapshot numbers are specified.

I am having a little bit of trouble figuring out which arguments are required/what the default behavior is and was wondering if you could clarify. For example, if I run this and my input directory contains 2D and 3D snapshots will it just go ahead and concatenate all of them unless I tell it to omit certain types of output? Sorry if I'm missing something obvious!

It would also probably be good to update the documentation with these changes.

mabruzzo commented 5 days ago

I think this looks good! I like having the default behavior be that you don't have to specify the number of processes.

And it is less error-prone!

It might be nice to also do that with the snapshot numbers at some point.

I think that could definitely be nice! One of my goals here is achieving more sensible default behavior with fewer arguments (along those lines, it could be cool to do something along those lines with the -s or -o flag)

My next primary concern is introducing some level of parallelism.

I am having a little bit of trouble figuring out which arguments are required/what the default behavior is and was wondering if you could clarify. For example, if I run this and my input directory contains 2D and 3D snapshots will it just go ahead and concatenate all of them unless I tell it to omit certain types of output? Sorry if I'm missing something obvious!

The behavior of concat_3d_data.py and concat_2d_data.py are unchanged in this regard.

For concat.py you need to opt in for each type of output with the --kind parameter. Consequently,

--kind 3D just does 3D data
--kind slice just does slices (--kind slice-xy,xz will omit data from the yz datasets).
--kind proj just does axis-aligned projections
--kind rot_proj just does rotated projections
--kind 3D slice proj rot_proj would do all of the above

Once we add support for concatenating particles, I might support --all-kinds to handle all types of supported datasets)

It would also probably be good to update the documentation with these changes.

I'm happy to do that. I might hold off until I have your approval/this is merged.

bcaddy commented 5 days ago

This is great. I'm totally for it

helenarichie commented 5 days ago

Ah, I see. I had gotten myself a little confused before... I appreciate the clarification!

I think this is good to go!

evaneschneider commented 1 day ago

I'll go ahead and merge this, @mabruzzo feel free to update the wiki when you get a chance.

cholla-hydro / cholla

Ergonomic improvements to concatenation scripts #403