DedalusProject / dedalus

A flexible framework for solving PDEs with modern spectral methods.
http://dedalus-project.org/
GNU General Public License v3.0
513 stars 121 forks source link

Enables merging of virtual files #184

Closed evanhanders closed 2 years ago

evanhanders commented 2 years ago

This PR fixes #182 by adding two functions to tools/post.py (merge_virtual_analysis(), merge_virtual_file()) and updating merge_setup() to allow virtual files to be merged into single files.

This file also brings merge_setup() from d2 -> d3, merging together the scales (which are now distributed and hashed, unlike in d2).

The workflow of the virtual file merge is:

  1. Move virtual file from e.g., snapshots_s1.h5 -> tmp_snapshots_s1.h5
  2. Create 'full' file from data in merged file called snapshots_s1.h5
  3. Delete tmp_snapshots_s1.h5

The workflow of the full merge from partial files is identical to d2 (creates snapshots_s1.h5 from snapshots_s1/snapshots_s1_p*.h5).

@kburns and @jsoishi , it would be great to have both of you review this; I don't think I am allowed to assign reviewers.

kburns commented 2 years ago

Thanks! Hmm it's a good question about how to name things. Maybe what we should do is add something to the names of the virtual files when we initially create them, like snapshots/snapshots_s1.vh5 or snapshots/snapshots_s1v.h5 or snapshots/snapshots_s1_virtual.h5. And then the merged file could have the regular name snapshots/snapshots_s1.h5. I like this because in a world where parallel hdf5 worked perfectly, we would just be directly writing to the snapshots/snapshots_s1.h5 joint file without process files in the first place. Any thoughts @jsoishi?

jsoishi commented 2 years ago

I agree that we should reserve snapshots/snapshots_s1.h5 for the truly merged file because of the hope that someday we will have a functional parallel hDF5 mechanism, and I think that changing the file extension would be confusing, since .h5 is such an established standard.

What about snapshots/virt_snapshots_s1.h5? I'm afraid that snapshots_s1v.h5 might be too subtle...

On Sat, Apr 2, 2022 at 10:34 AM Keaton J. Burns @.***> wrote:

Thanks! Hmm it's a good question about how to name things. Maybe what we should do is add something to the names of the virtual files when we initially create them, like snapshots/snapshots_s1.vh5 or snapshots/snapshots_s1v.h5 or snapshots/snapshots_s1_virtual.h5. And then the merged file could have the regular name snapshots/snapshots_s1.h5. I like this because in a world where parallel hdf5 worked perfectly, we would just be directly writing to the snapshots/snapshots_s1.h5 joint file without process files in the first place. Any thoughts @jsoishi https://github.com/jsoishi?

— Reply to this email directly, view it on GitHub https://github.com/DedalusProject/dedalus/pull/184#issuecomment-1086653146, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7SHYPHY7IQKLT2W2SB47TVDBLFVANCNFSM5Q7HZNRQ . You are receiving this because you were mentioned.Message ID: @.***>

kburns commented 2 years ago

Maybe the merge script should just overwrite the virtual file, using the same name? That way no plotting scripts, etc., downstream have to adjust to the change. I think this makes sense since the point of virtual files is to appear just like merged files, so in theory nothing in the workflow should care if the file is virtual or concrete. This would also avoid having both virtual and concrete merged files side-by-side, which could result in duplicated plots or similar if you're doing analysis like python3 plot_file.py snapshots/*.h5.

jsoishi commented 2 years ago

Even better idea. +1

On Mon, Apr 4, 2022 at 10:33 AM Keaton J. Burns @.***> wrote:

Maybe the merge script should just overwrite the virtual file, using the same name? That way no plotting scripts, etc., downstream have to adjust to the change. I think this makes sense since the point of virtual files is to appear just like merged files, so in theory nothing in the workflow should care if the file is virtual or concrete. This would also avoid having both virtual and concrete merged files side-by-side, which could result in duplicated plots or similar if you're doing analysis like python3 plot_file.py snapshots/*.h5.

— Reply to this email directly, view it on GitHub https://github.com/DedalusProject/dedalus/pull/184#issuecomment-1087635177, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7SHYKXJIAHHGADUVUGAMTVDL4R3ANCNFSM5Q7HZNRQ . You are receiving this because you were mentioned.Message ID: @.***>

evanhanders commented 2 years ago

This was actually the first implementation I tried (just replacing the virtual file with a merged file of the same name), but I was afraid this was a bit risky because it basically sets cleanup=True by default. Is that OK though?

I think there's two (almost identical) ways to make this happen. What's your preference?

  1. move virtual file from snapshots_s1.h5 -> virt_snapshots_s1.h5. Merge it into snapshots_s1.h5. delete virt_snapshots_s1.h5 and snapshots_s1/.
  2. merge snapshots_s1.h5 into merged_snapshots_s1.h5. delete snapshots_s1.h5 and snapshots_s1/. Move merged_snapshots_s1.h5 -> snapshots_s1.h5
kburns commented 2 years ago

I think the first isn't too bad, and it doesn't have to (and probably shouldn't) delete the process files by default, just overwrite or move/delete the virtual merged file in place of the concrete merged file. We can also clean things up to make it easier to recreate the virtual merged file from the process files in post-processing, instead of just online.

lecoanet commented 2 years ago

It would also be useful to have a merge that works without a virtual file.

kburns commented 2 years ago

Marking as a draft since we're still WIP. I'm free the next few days/week if we want to push more on this.

evanhanders commented 2 years ago

I'm happy to make a push ~tomorrow and early next week to get this together. Just to check, here's what we want to do, right?

  1. Clean up the 'merged_' files that this does right now (as we discussed a few weeks ago
  2. Get a merge that creates the virtual file working

Anything else?

evanhanders commented 2 years ago

@kburns, @jsoishi, @lecoanet -- This PR has been updated with the requested changes:

  1. Virtual files now merge 'in-place', so the end product of merging a virtual file called 'snapshots_s1.h5' is a merge file of the same name.
  2. Full files can now be merged from partial files, and the distributed, hashed scales are properly merged.
kburns commented 2 years ago

Thanks!

jsoishi commented 2 years ago

BTW: some incoming changes to HDF5 files in d3 on my fork.

On Thu, Apr 28, 2022 at 3:15 PM Evan Anders @.***> wrote:

I'm happy to make a push ~tomorrow and early next week to get this together. Just to check, here's what we want to do, right?

  1. Clean up the 'merged_' files that this does right now (as we discussed a few weeks ago
  2. Get a merge that creates the virtual file working

Anything else?

— Reply to this email directly, view it on GitHub https://github.com/DedalusProject/dedalus/pull/184#issuecomment-1112567324, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA7SHYN2Z7K3CUTRLDO3VYLVHLPT3ANCNFSM5Q7HZNRQ . You are receiving this because you were mentioned.Message ID: @.***>