Wetland area and areas of landunits in coastal grid cells should be determined more rigorously

Brief summary of bug

I believe that there are some problems with how we calculate wetland area, both in mksurfdata_map and at runtime. These problems lead to a situation where I believe there is currently no correct answer to the question, "Should PCT cover raw datasets specify the percent of the total grid cell or just the percent of the land area in which the given landunit appears?"

@olyson @lawrencepj1 @dlawrenncar @wwieder @ekluzek @slevisconsulting @mvertens

General bug information

CTSM version you are using: master

Does this bug cause significantly incorrect results in the model's science? Yes, at least to some degree – though errors are probably relatively small

Configurations affected: All

Background / motivation

@olyson recently asked:

Should the raw input dataset fields (e.g., PCT_URBAN in 0.05deg raw input urban dataset) represent the fields in terms of a percentage of total gricell area, or in terms of a percentage of the total land area in the gridcell?

After giving this some thought, I have come to the conclusion that either choice will do some things correctly and some things incorrectly. I feel that it would be helpful if we could provide a clear answer to this question and have the resulting landunit fractions in the model be consistent with the raw datasets.

I should note that I have not carefully examined all of the relevant code or tried running various scenarios: My thoughts below are based mainly on my recollections of how mksurfdata_map works. Also note that, although I refer to mksurfdata_map in my comments, really I don't imagine any changes being made to that deprecated code: all changes can be applied to the new mksurfdata_esmf.

Problems with the current formulation

Problems with specifying raw dataset PCT cover as percent of the entire grid cell

(A) If a raw dataset (e.g., for glacier) specifies PCT cover as percent of the entire grid cell, for a grid cell that is partially ocean, then we can end up with too much vegetated area and too little area of this special landunit in the final surface dataset.

For example, consider the idealized scenario where all raw data sets are at the same resolution, all raw data sets agree about the land/ocean mask/fraction, the model resolution matches this raw data set resolution, and the ocean model grid also agrees exactly with this land/ocean fraction. Let's consider a 0.05 deg grid cell in this scenario where the raw data sees 98% ocean, 1% glacier and 1% vegetated. If the glacier raw data set encoded this as 1% glacier, then I think that mksurfdata_map would create a model grid cell that is 1% glacier and 99% vegetated. When run in the model, with a landfrac of 2%, the final grid cell cover would be 98% ocean (i.e., owned by the ocean model), 0.02% glacier and 1.98% vegetated; this would be wrong. If, instead, the glacier raw data set encoded this as 50% glacier, then I think mksurfdata_map would create a model grid cell that is 50% glacier and 50% vegetated. When run in the model, with a landfrac of 2%, the final grid cell cover would be 98% ocean, 1% glacier and 1% vegetated, as desired. So for this idealized scenario where all data sets agree in their resolution and land fraction and no mapping is needed, I think we would want raw data sets to give percent cover in terms of percentage of just the land area in the grid cell, at least given the current treatments of wetland area.

Problems with specifying raw dataset PCT cover as percent of just the land area

However, there are also problems with doing the reverse – specifying raw dataset PCT cover as percent of just the land area. I can think of two off-hand:

(B) Averaging multiple source grid cells to form a destination cell: Consider a model (destination) grid cell that exactly contains two raw data (source) grid cells. One of the source grid cells is as above – with 98% ocean, 1% glacier and 1% vegetated; if we encode glacier area as percent of just the land area, the raw dataset would prescribe 50% glacier area. The other source grid cell is 100% land, but has 0% glacier. As above, let's assume that the ocean grid agrees exactly with the land fractions on the raw data sets. Then the resulting land grid cell would be 51% land and 49% ocean. The glacier landunit in this grid cell would cover 25% of the land area – since it is determined by averaging the 50% in grid cell 1 with 0% in grid cell 2. This is much more glacier area than there should really be: since grid cell 1 is mostly ocean, its percent glacier cover should have gotten much less weight in forming the average, and the actual glacier percent over the land portion of this grid cell should be a little less than 1%. (We could handle this by having a land fraction on each raw data set and including that in the mapping, but this could be hard to get right if different raw data sets have different land masks, and I believe that there is a simpler solution, as I explain below.)

(C) Filling missing areas with wetlands: Another scenario is one where the ocean model grid doesn't resolve the coast line exactly, so CTSM needs to fill in some missing areas with wetland. Our logic for how much wetland area to apply is currently inconsistent: see https://github.com/ESCOMP/CTSM/issues/1001#issuecomment-678518813 , where I state:

I think that the faulty logic is in what we do with wetland % in grid cells with pftdata_mask > 1e-6: I'm starting to think that our handling of wetlands with respect to lake and glacier is correct (in places with pftdata_mask = 0, fill the remainder of the grid cell with wetlands where it wasn't filled with glacier or lake), and that the real problem is that, anywhere where we have pftdata_mask > 1e-6, we fill the whole grid cell with vegetation.

(see that issue comment for more elaboration). However, I believe that comment is only correct if the glacier and lake raw data sets give percent cover in terms of percentage of the entire grid cell.

Consider a scenario like that in (A), but now the ocean model grid does not fully resolve the coast line, so the land model grid cell in question ends up with 100% landfrac. In this case, we would want a wetland area of 98%. I think there is currently no way to achieve this 98% wetland cover in this scenario, but we certainly cannot achieve it with a glacier raw data set that prescribes 50% glacier cover, as suggested in (A).

(This was my motivation for this comment last year: https://github.com/ESCOMP/CTSM/issues/1406#issuecomment-871555350 )

Suggested path forward

I believe that we can handle these scenarios and others correctly if we put in place the following conventions and changes:

(1) Raw data sets of PCT cover should represent the fields in terms of percentage of the total grid cell area. This keeps the averaging of raw data grid cells (scenario (B)) simple, without requiring use of the land fraction associated with each raw data set. This may already be how things are done; I'm not sure.

(2) Change the setting of wetland cover in mksurfdata_map so that it fills all area outside the current raw data set masks, rather than the current logic where wetland area is only added if pctlnd_pft(n) < 1.e-6_r8. More exactly, I think that it would make sense to consider two areas: (a) the mapped land fraction given by pctlnd_pft, and (b) the total special landunit area. It is important to consider (b) so that we don't overwrite some special landunits with wetland where these special landunits extend beyond the PFT data's land fraction. (This is particularly important for glaciers, where we can have floating ice shelves: see https://github.com/ESCOMP/CTSM/issues/545 .) We could probably do even better by considering each raw data set's notion of the land fraction, but thinking about these multiple, disagreeing land fractions makes my head hurt. So I propose treating the PFT data's land fraction as the definitive land fraction for the sake of setting wetland area, but imposing a limit that ensures that it doesn't overwrite any special landunits: something like:

pctwet(n) = min(100._r8 - pctlnd_pft(n), 100._r8 - pctgla(n) - pctlak(n) - pcturb(n))
pctwet(n) = max(pctwet(n), 0._r8)

In addition to setting wetland area more correctly, this formulation will also deal with scenario (A): The problem with the current code that leads to the issue in scenario (A) is that, for a grid cell with pctlnd_pft(n) > 1e-6, there is no wetland area and the natural vegetated landunit is given all of the remaining area in the grid cell. With the proposal here, most of the grid cell in scenario (A) would be filled with wetland, and natural vegetation would only fill the remainder: 1% of the grid cell, as desired.

Notes about the above:

I am assuming that pctlnd_pft is formed by mapping a field that specifies the land fraction – not just land mask – in each source grid cell. That is, I'm assuming that the source data for this field can be any value between 0 and 1 on the raw data grid. I'm not sure whether that is currently the case.
Note that all special land units are now included here, in contrast to the current code for setting wetland area, which ignores urban area. (Maybe that should have been changed in the context of https://github.com/ESCOMP/CTSM/pull/1587 but I didn't think of it then.) I am not including crop area... I think it's correct not to include crop area in this because I think it's effectively handled via pctlnd_pft in this respect (though I'm not positive about this).
- Update (2022-11-17): I think I will include crop area: see https://github.com/ESCOMP/CTSM/issues/1716#issuecomment-1318748012
This isn't completely correct in some cases – e.g., consider the case where pctlnd_pft(n) = 50 and pctgla(n) = 50, but all of the glacier cover is floating ice shelves that are complementary to the land fraction. Then we would really want pctwet(n) to be 0, but we would end up with it being 50. I think that handling this entirely correctly could be done with some extra mappings, but I think we can live with this error.

(See also https://github.com/ESCOMP/CTSM/issues/1001#issuecomment-678518813 for a similar suggestion, though I think I got the exact solution wrong there.)

Update (2022-04-29): If we wanted to accommodate inland wetlands (which are no longer used by default but may still be allowed), then we would need to add the above expression to any existing wetland areas read from the wetland raw dataset.

(3) Change CTSM's runtime code so that, in grid cells that have landfrac < 1, wetland area is preferentially decreased and then the remaining landunit areas (including any remaining wetland areas) are renormalized appropriately. This is needed because, within CTSM, we think of each landunit's weight as being the fractional area of the land-owned portion of the grid cell, not the fractional area of the entire grid cell. I'm not aware of this correction being done currently, but I'm pretty sure that it should be done – at least if we do the other things suggested above. This is particularly important with the change proposed in (2), since now we will more often have fractional wetland areas in a grid cell. (Without the change in (2), I think it has often – though not always – been the case that wetland area is either 0% or 100% of a grid cell, so this change (3) hasn't been as important.)

Care will need to be taken here to ensure that this same reweighting is also applied to transient land cover – including possibly the dynamic glacier areas that come from CISM (I want to give this glacier issue a little more thought).

Update (2022-04-25): see https://github.com/ESCOMP/CTSM/issues/1716#issuecomment-1108959121 for thoughts on the dynamic glacier areas from CISM. The summary is that I think these are already expressed as a fraction of the land-owned area, so they do NOT need to be reweighted – i.e., this coupling can stay as it currently is.
Update (2022-04-29): My initial thought here is that, in initialization, we would compute a scaling factor for each grid cell based on a combination of landfrac and the wetland area on the surface dataset. We would store a gridcell-level array with those scaling factors. Then, when reading landunit areas from the landuse timeseries file, we would apply the same saved scaling factors.

For example, in scenario (A), the surface dataset would specify 98% wetland, 1% glacier and 1% natural vegetation. The runtime land fraction is 2%. Wetland area would be preferentially decreased – to exactly 0% in this case – and then the remaining areas would be renormalized so that the resulting grid cell has 50% glacier and 50% natural vegetation over the small land area – as desired.

In scenario (C) we again have 98% wetland, 1% glacier and 1% natural vegetation on the surface dataset, but now the runtime land fraction is 100%. There would be no decrease of wetland area in this scenario, and the resulting grid cell would remain with 98% wetland, 1% glacier and 1% natural vegetation – again, as desired.

Also consider an intermediate scenario with the same specification on the surface dataset but where the runtime land fraction is 50%. Wetland area would be decreased to 48%, then the remaining landunit areas would be renormalized, resulting in a grid cell where the land-owned areas are covered by 96% wetland, 2% glacier and 2% natural vegetation. (So, in terms of the total grid cell area, we would have 50% ocean, 0.5*96% = 48% wetland 0.5*2% = 1% glacier, and 0.5*2% = 1% natural vegetation – once again, as desired.)

Finally, consider a scenario where the runtime land fraction is greater than the wetland area: again, the surface dataset specifies 98% wetland, 1% glacier and 1% natural vegetation, but now the runtime land fraction is 1%. This would work out the same as the first example: wetland area would be completely decreased, and other landunit areas would be renormalized. No explicit changes are needed to account for the land fraction being smaller than the specified wetland area.

(4) Update (2022-05-03): I realized that a fourth piece is needed, too: Change the remapping of PCT landunit areas in mksurfdata to NOT normalize by the mask. See https://github.com/escomp/ctsm/issues/1716#issuecomment-1116375444 for details.

Final thoughts / summary about how this changes the treatment of wetland areas

In the past, we have sort of had a rule that wetlands covered either 0% or 100% of a grid cell. Before the fix for https://github.com/ESCOMP/CTSM/issues/545 the exception was lakes, and after that fix the exception was lake and glacier areas. But I am coming to see that – in addition to the inconsistency in how different landunits are treated with respect to determining wetland area – this rule makes it hard/impossible to treat other aspects of landunit areas correctly.

I am therefore proposing a more rigorous determination of wetland areas, both in mksurfdata_map and at runtime, which I believe will allow us to have a more correct and more conceptually consistent treatment of landunit areas in coastal grid cells.

I would welcome critical feedback here. Every time I have thought about this wetland issue over the last few years, I have come to different conclusions about the appropriate solution. That said, I feel like I have given the issue more careful thought this time and am feeling pretty good about my suggested path forward... but this still definitely deserves some critical thought from others to make sure it makes sense to other people as well, because I could easily be making some logical errors here.

Care will need to be taken here to ensure that this same reweighting is also applied to transient land cover – including possibly the dynamic glacier areas that come from CISM (I want to give this glacier issue a little more thought).

I have gotten my head partly back into the CISM coupling, particularly some of the notes in https://docs.google.com/document/d/144PZ63gwpz875U4EYEI7NuxnF__V955QJOxV9LcEAL0/edit# as well as the code implementing the remapping of glacier cover from CISM to CTSM. The mapping of ice-covered fraction from CISM to CTSM is normalized by CISM's land fraction, implying that this ice-covered fraction is meant to represent the fraction of land area that should be covered by glaciers. This, in turn, suggests to me that we should not reweight the dynamic glacier areas from CISM in the same way as I suggest for areas coming from the surface dataset and landuse timeseries files.

The bad news here is that it means there would be an inconsistency in the treatment of areas coming from CISM vs. those coming from CTSM's input files: ice-covered area from CISM is given as fraction of land-owned area, whereas areas from the input files will be given as fraction of the entire grid cell. In addition, realizing this inconsistency makes me think that it's possible that we could be treating CISM's remapped ice-covered fraction in a different way that more accurately maintains the correct glacier cover when CISM's land mask differs from CTSM's. Finally, I'm not entirely sure here how to interpret the wetland-covered portion of a CTSM grid cell: should this be considered part of the land-owned area or not for the sake of this CISM-CTSM mapping? (This question is further complicated by the fact that CISM can sometimes intentionally dictate that there should be ice sheet covering the ocean / wetland portion of the grid cell – for floating ice shelves.)

The good news, though, is that I feel like it's probably reasonable to keep the CISM-CTSM remappings as they currently are even if we want to do renormalization of other areas: Because the ice-covered area from CISM is already in terms of the fraction of land-owned area, the fact that we are doing renormalization of the other areas (which are in terms of fraction of the total grid cell) does NOT imply that we should also be renormalizing the areas from CISM. The main reason I see this as good news is that @whlipscomb and I spent a long time working out how to treat various renormalizations to achieve conservation in the CISM-CTSM coupling, and figuring out how to revise this could take substantial time. In addition, trying to have CISM send ice-covered fraction in terms of total grid cell area and then renormalizing these fractions as is planned for the other areas could lead to inconsistencies: For example, if CISM says that 70% of the grid cell is ice-covered but CTSM's land fraction is only 50%, then we'd end up saying that 140% of CTSM's portion of the grid cell should be glacier. (We could, of course, reduce this to 100% – and that is effectively what we'll be doing when we renormalize fractions on the surface dataset. But I'm a little concerned that this potential inconsistency could make it harder to ensure that we conserve mass in the CISM-CTSM coupling.)

Update (2022-11-22): As noted in recent comments below, I am reworking things so that the areas on the surface dataset and landuse timeseries file remain as % of the land area. So there actually will be consistency between the areas on the surface dataset and those from CISM in this respect.

Some updates on this issue from conversations over the last week:

The new percent lake data set from Inne Vanderkelen specifies lake as percent of the grid cell area. This is also what we decided for the new percent glacier data set (#1406). Percent crop needs some more significant rework (see #1556). So I think that, for now, point (1) of the suggested path forward just entails changing the percent urban data set to be percent of the grid cell area rather than percent of the land area.

@mvertens and I looked at the code that maps the PCT areas from source to destination. This is currently done with FRACAREA normalization, which I believe means that the field on the destination (model) grid is normalized by the fraction of overlapping source areas that are unmasked. This normalization would be appropriate if we wanted the final PCT areas to be specified as PCT of the land area, and is also appropriate for most / all parameters on the surface dataset (i.e., variables other than landunit areas on the surface dataset), but it is not appropriate for final PCT areas specified as PCT of the total grid cell area. So this leads to another thing to do:

(4) Change the remapping of PCT landunit areas in mksurfdata to NOT normalize by the mask. Probably the best way to do this is to have an argument to the subroutine that creates the route handle that says whether or not to do FRACAREA normalization. For the PCT fields, we will not do FRACAREA normalization.

I have also given more thought to point (2) – how to set the wetland – i.e., "ocean" area. When I talked about this last week with @olyson @lawrencepj1 and others, the idea came up that maybe there would be a better way to determine the wetland / ocean area, using the residual after accounting for other land unit areas. But I actually think that the way I suggested above – based on pctlnd_pft – makes the most sense. The way I'm thinking about this is that we want to determine PCTOCEAN, and pctlnd_pftcan be thought of as a data set of the complement this desired PCTOCEAN. With this conceptualization in mind, 100._r8 - pctlnd_pft(n) gives the mapped PCTOCEAN in each grid cell. We could think of that as being similar to other special landunits. However, to correctly handle situations like floating ice shelves, we don't want to give PCTOCEAN the same priority as other special landunits, because there could be areas where pctlnd_pft sees ocean but one of the special landunits (most likely glacier or lake) sees coverage by itself, and we want to give priority to those other special landunits in this situation. The formulation in the original comment effectively does this:

pctwet(n) = min(100._r8 - pctlnd_pft(n), 100._r8 - pctgla(n) - pctlak(n) - pcturb(n))
pctwet(n) = max(pctwet(n), 0._r8)

– this is equivalent to setting pctwet = 100._r8 - pctlnd_pft, but then checking if pctwet + pctgla + pctlak + pcturb exceeds 100, and if so, first reducing pctwet to try to keep the total no more than 100 – i.e., giving priority to the other special landunits. (I'm not sure if pctcrop should also be included here, but I think it's okay for it not to be.)

As noted in my update to the original comment, in the general case where we allow inland wetlands, we will want to add this PCTOCEAN to any existing inland wetland.

From discussion with @ekluzek - he says that, in discussions with @lawrencepj1 , it sounded like the current pft landmask is unreliable, specifying too much land. To go with the fixes suggested above, we would want a more reliable land mask.

We have been discussing possibly doing away with wetlands and replacing them with bare ground. The original plan for that entailed changing mksurfdata_map in ways that would invalidate many of the thoughts in this issue. However, after giving this some more thought and reviewing the thoughts in this issue, I think it will be best to maintain wetlands (or something equivalent) on the surface dataset, with the changes laid out above, and instead convert these areas to natural vegetation at runtime (if desired). The advantages of this approach are:

It avoids the need to recreate surface datasets
It lets us control this behavior at runtime
It will let us treat landunit areas more rigorously in grid cells with landfrac < 1 (as discussed in this issue)

From discussion with @ekluzek - he says that, in discussions with @lawrencepj1 , it sounded like the current pft landmask is unreliable, specifying too much land. To go with the fixes suggested above, we would want a more reliable land mask.

For reference, the discussion with @lawrencepj1 and others appears in #1703

Following on from @slevisconsulting the problem is not that the land mask is not reliable. It is that the land mask is at the resolution of the data and can only be either 0 or 1, where 1 has any amount of land in it and 0 if it is completely free of land. The land fraction provides the amount of land down from the MODIS 1km land cover data.

Returning to the question of how the pft data's land mask is defined: @lawrencepj1 said:

the land mask is at the resolution of the data and can only be either 0 or 1, where 1 has any amount of land in it and 0 if it is completely free of land. The land fraction provides the amount of land down from the MODIS 1km land cover data.

Am I understanding this correctly that the original 1km raw data classifies each 1km cell into land vs. ocean, but then on the somewhat coarser raw datasets used as input to mksurfdata, thisis aggregated so that any cell that has at least one 1 km land cell gets a value of 1?

If so, I think it would be ideal to change this long-term so that we have a field on these raw datasets that gives the true fractional land / ocean cover – e.g., if you have a raw data grid cell composed of 25 1-km cells, 10 of which are land and 15 of which are ocean, then this raw data grid cell would have a landfrac of 0.4 (as opposed to the current situation, where it is given a value of 1).

This would make the dataset consistent with my original comment:

I am assuming that pctlnd_pft is formed by mapping a field that specifies the land fraction – not just land mask – in each source grid cell. That is, I'm assuming that the source data for this field can be any value between 0 and 1 on the raw data grid. I'm not sure whether that is currently the case.

However, this is probably a second-order issue, so doesn't necessarily need to be changed for ctsm5.2 if changing it would be a pain.

Hi Bill

Yes you are 100% correct that the 1km data is from a climatology of MODIS land cover data which is either water body or another land cover class. The landfrac field does have this information included in it for any resolution. Agree this is not a high priority but the landmask variable is redundant given that landfrac > 0.0 is the same as landmask = 1.0.

Best Peter -- Dr Peter Lawrence Terrestrial Science Section National Center for Atmospheric Research 1850 Table Mesa Drive Boulder Colorado 80305

Work: 1-303-497-1727 Cell: 1-303-956-6932

On Thu, Nov 3, 2022 at 5:26 PM Bill Sacks @.***> wrote:

Returning to the question of how the pft data's land mask is defined: @lawrencepj1 https://github.com/lawrencepj1 said:

the land mask is at the resolution of the data and can only be either 0 or 1, where 1 has any amount of land in it and 0 if it is completely free of land. The land fraction provides the amount of land down from the MODIS 1km land cover data.

Am I understanding this correctly that the original 1km raw data classifies each 1km cell into land vs. ocean, but then on the somewhat coarser raw datasets used as input to mksurfdata, thisis aggregated so that any cell that has at least one 1 km land cell gets a value of 1?

If so, I think it would be ideal to change this long-term so that we have a field on these raw datasets that gives the true fractional land / ocean cover – e.g., if you have a raw data grid cell composed of 25 1-km cells, 10 of which are land and 15 of which are ocean, then this raw data grid cell would have a landfrac of 0.4 (as opposed to the current situation, where it is given a value of 1).

This would make the dataset consistent with my original comment:

I am assuming that pctlnd_pft is formed by mapping a field that specifies the land fraction – not just land mask – in each source grid cell. That is, I'm assuming that the source data for this field can be any value between 0 and 1 on the raw data grid. I'm not sure whether that is currently the case.

However, this is probably a second-order issue, so doesn't necessarily need to be changed for ctsm5.2 if changing it would be a pain.

— Reply to this email directly, view it on GitHub https://github.com/ESCOMP/CTSM/issues/1716#issuecomment-1302786402, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3OJOKYYF6ZJPCSJAJNPSLWGRCZ7ANCNFSM5UCM6ZLQ . You are receiving this because you were mentioned.Message ID: @.***>

The landfrac field does have this information included in it for any resolution.

Ah, thanks for clarifying this. I had missed the fact that we already have this field on the pft raw dataset. Given this, I can move forward with a change to use landfrac rather than landmask in the relevant places. Currently, this is going to lead to a small inconsistency due to the issue I noted here https://github.com/ESCOMP/CTSM/issues/1556#issuecomment-1304320939 which I hope can be addressed for ctsm5.2.

@slevisconsulting @ekluzek @lawrencepj1 - I'm working on a change to mksurfdata_esmf so that we use LANDFRAC on the raw PFT files rather than the current use of the mask on those files. One question I have is: When we're making a landuse timeseries file in addition to the surface dataset, is it worthwhile to confirm that LANDFRAC is the same on all of the input files, or is this an unnecessary, overly paranoid check? I ask because I see this commented-out block of code:

https://github.com/billsacks/ctsm/blob/dc8cb25b930577d97f6559cbca590067fb63cd53/tools/mksurfdata_esmf/src/mksurfdata.F90#L881-L900

With the new logic I'm thinking of, I could re-enable this consistency check – or I can just skip it and avoid remapping LANDFRAC every year if this feels overly paranoid. Thoughts?

After giving more thought to the formulation for pctwet – i.e., pct_ocean – I think I am going to include pct_crop so that rather than my original:

pctwet(n) = min(100._r8 - pctlnd_pft(n), 100._r8 - pctgla(n) - pctlak(n) - pcturb(n))

we'll have

pctwet(n) = min(100._r8 - pctlnd_pft(n), 100._r8 - pctgla(n) - pctlak(n) - pcturb(n) - pctcft(n)%get_pct_l2g())

My rationale is given in this comment that I am adding to the code:

         ! Note that we include pct_crop in the following, but NOT pct_natveg. The
         ! assumption behind that is that pct_crop is more reliable and/or more important,
         ! and should not be overwritten, whereas pct_natveg can be overwritten with a
         ! special landunit. For example, consider a case where the PFT dataset specifies
         ! 40% land, with 20% crop and 10% natveg (so, implicitly, 10% special landunits).
         ! If the only special landunit is glacier and it has 15% cover, then we want to
         ! end up with the glacier overwriting natural veg, so we end up with 20% crop, 5%
         ! natveg, 15% glacier and 60% ocean. However, if glacier has 30% cover, then we
         ! will assume that some of that glacier extends over the ocean rather than
         ! overwriting crop, so we end up with 20% crop, 30% glacier, 50% ocean and 0%
         ! natveg.

It seems like arguments can be made either way, but I feel like this will be a reasonable compromise.

[...] One question I have is: When we're making a landuse timeseries file in addition to the surface dataset, is it worthwhile to confirm that LANDFRAC is the same on all of the input files, or is this an unnecessary, overly paranoid check?

It would NOT seem overly paranoid to me if a user pointed to raw data from any number of different sources, e.g. 850-1849 from one source, 1850-2014 from another, 2015-2100 from another, and 2100-2300 from another. If we think this could happen, then we should probably reenable the check. Does that sound reasonable?

Sounds good. I went ahead and re-enabled the check yesterday, so I'll keep it in place.

I had a useful meeting with @ekluzek and @slevisconsulting this morning. I presented a revised plan for how to handle this, which I'll try to summarize here.

My previous plan (laid out above) involved specifying landunit areas on the surface dataset as percent of the gridcell, and then doing a correction of landunit areas at runtime based on the actual runtime landmask (derived from the ocean grid). However, as I thought more about how this would be implemented, it started to feel overly complex and challenging to get right in a transient run (with changes in areas of different landunits). And, with our plan of converting wetland areas to bare ground moving forward, it felt like the added complexity wasn't going to buy us much (and might actually make things worse).

So my new plan is: We'll stick with having raw datasets specify their areas as percent of the gridcell (because this helps fix some issues), but we'll think of the final landunit areas on the surface dataset as being percent of the land area, rather than percent of the total gridcell area. I believe that is consistent with the current (implicit) formulation of the surface dataset. The main thing we lose by doing it this way is the ability to have a gridcell that is represented at runtime as being partially covered by wetland and partially covered by land, in situations where the runtime landmask disagrees with the raw data landmask. With my new plan (similar to the current state of affairs, though not exactly the same) we'll only have wetland area in the final surface dataset if the raw datasets consider a grid cell to be 0% land.

To achieve this plan, we'll still calculate a pct_ocean as laid out above. However, we'll just use that pct_ocean to determine how much of the grid cell should be filled with natveg, then rescaling landunit areas to sum to 100% of the land-covered area after that point. If we think of the current surface dataset as specifying landunit areas as percent of the land-covered area, then the current operation of mksurfdata essentially assumes that, anywhere where pctlnd_pft > 0, the entire grid cell is land: any part of the gridcell not claimed by special landunits in this situation is filled with natveg. In contrast, with the new plan, we'll only fill the grid cell with natveg up to the land portion of the grid cell, and then rescale all landunits to sum to 100% after that. This will avoid the issues we currently have where, for example, if the raw datasets specify 1% natveg, 1% glacier, and 2% pctlnd_pft, we'll end up with a surface dataset containing 99% natveg and 1% glacier. With the new plan, the surface dataset will specify 50% natveg and 50% glacier in this scenario.

Update (2022-11-22): I realized that this supposed issue actually wasn't as much of an issue as I thought it was: Since we had originally been using FRACAREA normalization for these pct landunit fields, I think we would have ended up with 50% natveg and 50% glacier in the above scenario, assuming the raw datasets agreed on the land mask.

For the calculation of pct_ocean, arguments can be made either way as to whether we should include pct_crop and/or pct_natveg in the min expression for pct_ocean. I am going to stick with including pct_crop but not including pct_natveg. In essence, this is saying that we'll let special landunit area grow into the natveg area before growing into ocean, but we'll have special landunit area grow into ocean before growing into crop or any other special landunit area. But the real reason I'm making this choice is a more pragmatic one: When constructing the transient datasets, there is a potential for pct_ocean to change from year to year due to its dependence on the areas of various landunits. But a change in pct_ocean would force a rescaling of all landunits, which could lead to apparent area changes in landunits whose areas should really stay fixed in time. I want to minimize how often this occurs. In an ideal situation where all raw datasets agree with each other, the sum of all landunit areas should equal pctlnd_pft. This means that, in a more realistic scenario where datasets disagree, it is probably about as likely that we take pct_ocean from pctlnd_pft vs. from the sum of the landunit areas. Taking pct_ocean from the sum of the landunit areas can give rise to a changing pct_ocean in time, so we'd like to minimize how often that happens. If we exclude pct_natveg (which will typically be the largest landunit) from this sum, we should be taking pct_ocean from pctlnd_pft much more often than from the sum of the landunits, and so we should reduce the frequency of a changing pct_ocean. The downside is that we'll more often end up decreasing pct_natveg when maybe we should really be growing a special landunit into the ocean, but that feels like a reasonable price to pay for having a more stable pct_ocean over time.

@ekluzek and @slevisconsulting - we had discussed the possibility of writing out some new fields giving the two notions of pctlnd (or pct_ocean). However, my current thinking is this... let me know if you disagree: It seems like the main reason why it's helpful to know this is to know how often our notion of pctlnd changes on a transient dataset. But I realized that we can already get at this question more directly via looking at the existing LANDFRAC_PFT variable on a transient dataset. So I propose to NOT write out any new fields, but instead to make sure that the LANDFRAC_PFT variable gives the actual notion of landfrac in a given year. We can then look at whether this variable changes in time in a given grid cell (e.g., looking at the difference between the last and first years on a landuse timeseries file).

So I propose to NOT write out any new fields, but instead to make sure that the LANDFRAC_PFT variable gives the actual notion of landfrac in a given year. We can then look at whether this variable changes in time in a given grid cell (e.g., looking at the difference between the last and first years on a landuse timeseries file).

To make sure I understand, do you mean checking those things in the code rather than adding new fields to the fsurdat file? I'm fine with that, and I was thinking that we could direct reporting on this topic to the fsurdat.log file.

I meant we could manually check one or more landuse timeseries files after they're created (e.g., by doing some quick analysis in python).

Actually, upon further reflection, I think I'll write two things:

I'll keep the current LANDFRAC_PFT derived directly from LANDFRAC on the vegetation raw data file. As is currently done, I'll write this to both the surface dataset and the landuse timeseries file, but the latter not time-varying (since this field shouldn't change in time).
I'll add a new field giving the actual landfrac used in the mksurfdata mapping. I'll write this both to the surface dataset and the landuse timeseries file, the latter as a time-varying field. I'm thinking of calling this LANDFRAC_MKSURFDATA, but let me know if you have thoughts on a better name.

Thanks for this discussion and consideration. Will there be changes to how users upscale states and fluxes in analyzing results? Do we want to maintain backwards compatibility, or change things for accuracy with new surface datasets?

Will there be changes to how users upscale states and fluxes in analyzing results?

Good question. No. In fact, your comment encouraged me to step back and take a big picture view of where things stand now with the latest plan, and I came to the somewhat depressing conclusion that the new plan involves only some relatively minor tweaks rather than any major changes: Now that I have ditched the idea of having fractional wetland area, I realized that I'm kind of doing two things that cancel out: (1) changing FRACAREA normalization to DSTAREA normalization for the mapping of landunit areas; and (2) calculating a pct_land that I use to normalize all landunit areas. Before these changes, the pct landunit areas were essentially coming out of their individual mappings as percent of the land area, according to that raw data's land mask. After these changes, the pct landunit areas come out as percent of the grid cell, and then we normalize these to percent of the land area after reconciling all of the landunit areas. Although I haven't thought through this carefully, I believe that, if all of the datasets were consistent in their notion of landmask, the two approaches would give similar (or maybe even identical) results. So in retrospect, given where we're ending up, this probably wasn't worth the effort (sigh). In particular, I realized that my problem (A) from the original comment was wrong given that we were doing FRACAREA mapping, and in fact that was already working essentially orrectly. However, given that I am almost done with these changes, I feel like I might as well wrap them up.

I do feel like there are at least some minor improvements from the new approach:

The new approach seems less sensitive to discrepancies in the different raw datasets' notions of land mask.
The new approach treats wetland areas more consistently, even though not in the way I initially envisioned: now wetlands will only be put in grid cells that are not claimed by any landunit. (With the new runtime setting I implemented recently, these can still be converted to bare ground at runtime.)
I think that the new approach provides better pct landunit fields for the sake of weighting other fields (e.g., when using pct_natveg as a multiplication / normalization factor when mapping pct_nat_pft... though I haven't totally gotten my head back into that)
I feel like the new approach makes things a little more explicit and easier to reason about, which will hopefully lead this code and its behavior to be easier and less error-prone to change in the future.

I will take this opportunity to point out, though, that things will neither be better nor worse as far as #1297 is concerned.

Update (2022-11-27): I think my changes actually have more positive benefit than I thought above: see https://github.com/ESCOMP/CTSM/issues/1716#issuecomment-1328376515

Actually, upon further reflection and looking more carefully at some before-after diffs, I realized that – I think – the full set of changes does lead to more changes than I was thinking in https://github.com/ESCOMP/CTSM/issues/1716#issuecomment-1323814217 . In particular, I think that my thinking in that comment assumed that raw datasets were specified as percent of the land area rather than percent of the grid cell. As described in the original issue comment, there are problems with that specification (especially problem (B)), so there are good reasons to switch to a specification of raw datasets as percent of the grid cell (and some of the raw datasets were already being specified that way). And, given that the raw datasets are specified as percent of the grid cell, then there truly were problems with the original approach: it was NOT true that the percent covers coming out of the mapping were percent of the land area; in fact they were some weird hybrid of percent of the land area and percent of the grid cell. So I believe that the net set of changes I'm making truly leads to more consistency and correctness.

ESCOMP / CTSM