For mksurfdata_map: use maps with no source masking, applying mask separately

billsacks commented 6 years ago

@swensosc suggested this 2015-10-19, and it seems like a good idea to me: When mapping files from their raw data grid to CLM resolutions, we could use maps with no source masking, and then apply the mask in a separate step.

We currently have a LOT of mapping files from the mksurfdata_map raw data files to the CLM grids. Much of the reason we have so many is that we have a separate set of mapping files for each raw data mask - e.g., even if many of the raw data files are at the same 3' resolution, we need different mapping files for the different masks.

Sean pointed out that we should be able to use mapping files without masks, and then tweak the mapping algorithms to apply the source masks separately. I think we do things like that in other parts of CESM (e.g., in the coupler?). This would greatly reduce the number of mapping files we need to maintain. Furthermore, if a raw dataset is updated, and this update involves changing the mask, you wouldn't need to remake mapping files. (This was Sean's original motivation, as he is updating the lake dataset in this way.) Instead, mksurfdata_map would simply read the mask off of the (updated) raw data file.

ekluzek commented 6 years ago

My understanding of how the ESMF regridding works is that this isn't something that you can do. But, I could be wrong. The mapping files created have the masks inherently embedded into them. I don't know of an easy way to extract them out. You could assume without a mask, but that means when you run the mapping you will be averaging in data that is outside the mask. That's the thing that I think we want to ensure doesn't happen.

I don't really think the burden is that high for carrying around files at the same resolution, but multiple masks. Right now there are three half degree, five 3x3 minute, and 2 10x10 minute grids. So you'd have some speedup with this, but still the 1km-merge-10min_HYDRO1K-merge-nomask grid is the one that far and away takes the most time.

billsacks commented 6 years ago

@ekluzek Unless I'm overlooking something: You can have masks embedded in the grid files when creating the mapping files, but you don't have to. If you don't, then you need to do a bit more work in the mapping routine, but we actually already have code in place to do this: gridmap_areaave_srcmask in mksurfdata_map. This is used when the source mask isn't known ahead of time.

You may be right that the burden isn't that high for the different grid files, but the burden is higher for the combinatoric mapping files. I know we've talked about moving away from storing all of them eventually, though.

In the end, I don't have strong feelings about whether this should be done. I think it's a good idea, but I'm not sure if it gains us enough to be worth the development time.

mvertens commented 6 years ago

I think the right approach is to create these mapping files on the fly - and not store them. I had demonstrated that this was feasible and showed acceptable performance by simply translating the input format of the input grid files. Only one file did not work well with this. I think that this is the approach that should be pursued rather than still working with storing mapping files.

On Tue, Feb 13, 2018 at 3:51 PM, Bill Sacks notifications@github.com wrote:

@swensosc https://github.com/swensosc suggested this 2015-10-19, and it seems like a good idea to me: When mapping files from their raw data grid to CLM resolutions, we could use maps with no source masking, and then apply the mask in a separate step.

We currently have a LOT of mapping files from the mksurfdata_map raw data files to the CLM grids. Much of the reason we have so many is that we have a separate set of mapping files for each raw data mask - e.g., even if many of the raw data files are at the same 3' resolution, we need different mapping files for the different masks.

Sean pointed out that we should be able to use mapping files without masks, and then tweak the mapping algorithms to apply the source masks separately. I think we do things like that in other parts of CESM (e.g., in the coupler?). This would greatly reduce the number of mapping files we need to maintain. Furthermore, if a raw dataset is updated, and this update involves changing the mask, you wouldn't need to remake mapping files. (This was Sean's original motivation, as he is updating the lake dataset in this way.) Instead, mksurfdata_map would simply read the mask off of the (updated) raw data file.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ESCOMP/ctsm/issues/286, or mute the thread https://github.com/notifications/unsubscribe-auth/AHlxEz6ndzI1JdoAuSxZfesfVB29Um4Fks5tUhHegaJpZM4SEh01 .

billsacks commented 6 years ago

@mvertens I don't disagree. However, I'll point out that doing this suggestion could actually help at least as much if we're generating mapping files on the fly, because we'd only need to generate, say, 1/2 or 1/3 as many mapping files.

ekluzek commented 6 years ago

@mvertens as @billsacks says, yes, if that line of development (to create mapping files for mksurfdata_map on the fly) is taken up again, this change should happen along with it. It'll both shorten the time to make the mapping files and as @billsacks pointed out minimize how many are required. If we go to a paradigm of creating them on the fly, you want to create as few as possible as fast as possible. The time to create them needs to be sufficiently short though, to make that the standard mechanism.

slevis-lmwg commented 5 years ago

Tasks that I see mentioned above...

Sean's suggestion:

Separate the masks from the mapping files to reduce the number of mapping files needed
Keep the masks in new mask files and use when needed

Mariana's suggestion:

Create the mapping files on the fly, while running mksurfdata_map. I expect that this is within the scope of #644

slevis-lmwg commented 5 years ago

Corrected previous post to say #644

billsacks commented 5 years ago

@slevisconsulting - yes, Mariana's suggestion is in the scope of #644 , so this issue relates to Sean's suggestions, which we thought were a good idea both for the sake of dataset management and efficiency. Note that this will require changes to mksurfdata_map as well as the scripts / xml related to our dataset management.

billsacks commented 5 years ago

Once this issue is resolved, we can more easily resolve #8 .

Blocks #8 .

slevis-lmwg commented 5 years ago

An update: To avoid unnecessary work in #815 I have switched my attention to the present issue (#286).

qsub regridbatch.sh is running right now with changes in... mkmapdata.sh namelist_defaults_ctsm.xml namelist_defaults_ctsm_tools.xml namelist_definition_ctsm.xml that reflect the replacement of numerous SRC files containing various masks with one SRC file per SRC resolution that contains grid_imask = 1.

slevis-lmwg commented 5 years ago

For now regridbatch.sh seems to be working for all the source (SRC) and destination (DST) resolutions except 1km-merge-10min_HYDRO1K-merge-nomask

slevis-lmwg commented 5 years ago

Resubmitted with a couple of changes and seems to work for the 1km-merge-10min_HYDRO1K-merge-nomask SRC resolution.

I will open a PR soon to share my code mods to-date.

ESCOMP / CTSM

For mksurfdata_map: use maps with no source masking, applying mask separately #286