LSSTDESC / DESC_DC2_imSim_Workflow

BSD 3-Clause "New" or "Revised" License
3 stars 3 forks source link

Add a step at the end of the processing to store AGN checkpoint files to HPSS #34

Open heather999 opened 5 years ago

heather999 commented 5 years ago

For Run2.2i processing, checkpoint files will be produced and saved to tape in case we later decide to use them for AGNs. It would be nice to add this to the simulation processing. As mentioned here: https://github.com/LSSTDESC/DC2-production/issues/368#issuecomment-531253873

I should note that the HPSS for DESC at NERSC is only writable by the desc collaboration account. So, I'm not certain we can use globus.

villarrealas commented 5 years ago

If we can't use Globus, I don't see any way around this being an inordinately slow step. Can we ping NERSC on the subject?

heather999 commented 5 years ago

We can inquire, but my thinking wasn't to use Globus, but rather the xfer queue: https://docs.nersc.gov/jobs/examples/#xfer-queue So that we can communicate our needs to NERSC, do you have some ideas about how the job need to be set up? I am hoping we can (htar)tarball the checkpoint files for transfer to HPSS as part of the transfer step, for example.

villarrealas commented 5 years ago

Hrm. I could conceivably set things up for an xfer queue job. My gut would be to essentially just do a find script which executes htar commands on each individual file - the tricky part there would be my figuring out how to keep the directory structure in place.

An alternative would be to do a python script (which would be easier to get paths sorted on), but that has the downside of srun being fairly slow on average. That might cost us some significant time.

heather999 commented 5 years ago

We would want to run htar such that we end up with files 100 - 500 GB in size, so I'm imaging that htar file could contain many checkpoint files depending on how large they are individually. It may be easier to deal with the transfer outside of the simulation pipeline. I've found my xfer queue jobs handling 500 GB complete in less than an hour.

villarrealas commented 5 years ago

This would definitely be a separate script done in xfer queue as opposed to something during the main pipeline.

I had forgotten about the need to keep file sizes fairly large. With that case, I don't obviously see a way that preserves directory structure in archived files without waiting for the job to finish and eating that initial storage hit on $SCRATCH. Then we just have a question on if checkpoints should be stored separately from outputs.

katrinheitmann commented 5 years ago

So it is not possible that Antonio moves the stuff under his name using Globus and then somebody at NERSC does some "magic" to move it to DESC HPSS storage?

On 9/16/19 10:00 AM, Antonio Villarreal wrote:

Hrm. I could conceivably set things up for an xfer queue job. My gut would be to essentially just do a find script which executes htar commands on each individual file - the tricky part there would be my figuring out how to keep the directory structure in place.

An alternative would be to do a python script (which would be easier to get paths sorted on), but that has the downside of srun being fairly slow on average. That might cost us some significant time.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/LSSTDESC/ALCF_1.2i/issues/34?email_source=notifications&email_token=ADCD3DBTMUIXAM5CRQ3KNSLQJ6GQRA5CNFSM4IWU64IKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6ZHRUQ#issuecomment-531790034, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADCD3DHZQREJNAKKR2SMG7DQJ6GQRANCNFSM4IWU64IA.

heather999 commented 5 years ago

Yes, if Antonio can move the checkpoint files into some directory with read access granted to the lsst group, then I think it still falls to us to make that transfer into DESC HPSS. I can do that, I just need a coherent directory structure and some indication that a set of checkpoint files is ready for transfer. I was just hoping there was a way we could automate this step better. Perhaps it's a matter of coming up with a better system to handle writing to HPSS overall.

katrinheitmann commented 5 years ago

Actually, I was wondering if it can be done when the data is already on HPSS.

On 9/16/19 1:03 PM, Heather Kelly wrote:

Yes, if Antonio can move the checkpoint files into some directory with read access granted to the lsst group, then I think it still falls to us to make that transfer into DESC HPSS. I can do that, I just need a coherent directory structure and some indication that a set of checkpoint files is ready for transfer. I was just hoping there was a way we could automate this step better. Perhaps it's a matter of coming up with a better system to handle writing to HPSS overall.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/LSSTDESC/ALCF_1.2i/issues/34?email_source=notifications&email_token=ADCD3DDXBXPW2G7SOTIF5MDQJ634PA5CNFSM4IWU64IKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6Z2JIY#issuecomment-531866787, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADCD3DCIMZT3V4RUPFDQ75DQJ634PANCNFSM4IWU64IA.

heather999 commented 5 years ago

ok, now I understand. You want Antonio to create the htarballs of the checkpoint files, and write them to his own area in HPSS using Globus. Then we ask NERSC to reassign those files to DESC's HPSS area? We can inquire. I'll open a NERSC ticket and see what they say.

heather999 commented 5 years ago

I confirmed with NERSC that we can actually adjust the permissions so Antonio can write into the HPSS area for DESC. This will allow the use of Globus These files must be h-tarred into 100 GB - 500 GB chunks before transfer onto tape. I will be granting permissions in a specific area of the shared HPSS directory tree - path TBD. We'll have to test this out.