Why is `imputeBadAgeModel` fitted using bad age data subset?

CeresBarros commented 4 years ago

In LandR::makeAndCleanInitialCohortData, used in Biomass_borealDataPrep why is the model to input bad ages being fit with the data subset that has the bad ages, instead of the data subset that has good ages?

outAge <- Cache(statsModel, modelFn = imputeBadAgeModel,
                    uniqueEcoregionGroups = .sortDotsUnderscoreFirst(as.character(unique(cohortDataMissingAgeUnique$initialEcoregionCode))),
                    .specialData = cohortDataMissingAgeUnique,
                    omitArgs = ".specialData")

CeresBarros commented 4 years ago

in addition the metadata for P(sim)$imputeBadAgeModel states: "Model and formula used for imputing ages that are either missing or do not match well with Biomass or Cover. Specifically, if Biomass or Cover is 0, but age is not, then age will be imputed. Similarly, if Age is 0 and either Biomass or Cover is not, then age will be imputed."

However, the subsetting of "bad age" data in LandR::makeAndCleanInitialCohortData uses:

cohortDataMissingAge <- cohortData[, hasBadAge :=
                                       #(age == 0 & cover > 0)#| # ok because cover can be >0 with biomass = 0
                                       (age > 0 & cover == 0) |
                                       is.na(age) #|
                                       #(B > 0 & age == 0) |
                                       #(B == 0 & age > 0)
  ][hasBadAge == TRUE]#, by = "pixelIndex"]

So it seems to me that "Similarly, if Age is 0 and either Biomass or Cover is not, then age will be imputed"is not accurate.

achubaty commented 2 years ago

@CeresBarros has this been resolved with all the various changes over the last few months?

CeresBarros commented 2 years ago

No :/

CeresBarros commented 1 year ago

P(sim)$imputeBadAgeModel now agrees with the code in LandR::makeAndCleanInitialCohortData: Model and formula used for imputing ages that are either missing or do not match well with biomass or cover. Specifically, if biomass or cover is 0, but age is not, or if age is missing (NA), then age will be imputed. Note that age is zeroed where total biomass is 0 in LandR:::.createCohortData, which is run before makeAndCleanInitialCohortData

However, I'm still puzzled with the age data that is used to fit the model.

CeresBarros commented 1 year ago

Digging deeper: At some point before fitting the model the cohortDataMissingAgeUnique object is stripped of all data, except unique combos of "initialEcoregionCode" and "speciesCode":

cohortDataMissingAgeUnique <- unique(cohortDataMissingAge,
                                           by = c("initialEcoregionCode", "speciesCode")
      )[
        , .(initialEcoregionCode, speciesCode)
      ]

After this, the data is added back to these combos, from the original cohortData:

      cohortDataMissingAgeUnique <- cohortDataMissingAgeUnique[
        cohortData,
        on = c("initialEcoregionCode", "speciesCode"), nomatch = 0
      ]
      cohortDataMissingAgeUnique <- cohortDataMissingAgeUnique[!is.na(cohortDataMissingAgeUnique$age)]

However, since "bad" age lines were not removed from cohortData they're being added back (with the exception of NA ages which are excluded, see above). So it seems to me that bad ages of (age > 0 & cover == 0) are being used to fit the model that will later impute/overwrite these same ages. @eliotmcintire since you wrote this I guess you're the best person to ask "is there a reason why this is being done like this"? Were there maybe not enough data points per "initialEcoregionCode", "speciesCode" combo if the bad ages were excluded for fitting?

eliotmcintire commented 1 year ago

I don't recall. I am sorry. Need to have written more comments. I am better now...

On Wed., Oct. 19, 2022, 10:24 p.m. Ceres Barros, @.***> wrote:

Digging deeper: At some point before fitting the model the cohortDataMissingAgeUnique object is stripped of all data, except unique combos of "initialEcoregionCode" and "speciesCode":

cohortDataMissingAgeUnique <- unique(cohortDataMissingAge, by = c("initialEcoregionCode", "speciesCode") )[ , .(initialEcoregionCode, speciesCode) ]

After this, the data is added back to these combos, from the original cohortData:
  cohortDataMissingAgeUnique <- cohortDataMissingAgeUnique[
    cohortData,
    on = c("initialEcoregionCode", "speciesCode"), nomatch = 0
  ]
  cohortDataMissingAgeUnique <- cohortDataMissingAgeUnique[!is.na(cohortDataMissingAgeUnique$age)]
However, since "bad" age lines were not removed from cohortData they're being added back, which the exception of NA ages that are excluded (see above). So it seems to be that bad ages of (age > 0 & cover == 0) are being used to fit the model that will later impute ages on these pixels. @eliotmcintire https://github.com/eliotmcintire since you wrote this I guess you're the best person to ask "is there a reason why this is being done like this"? Were there maybe not enough data points per "initialEcoregionCode", "speciesCode" combo if the bad ages were excluded for fitting?

— Reply to this email directly, view it on GitHub https://github.com/PredictiveEcology/Biomass_borealDataPrep/issues/48#issuecomment-1284952965, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIMVWYNZWDYFFUBNACJXCLWEDJR3ANCNFSM4M6NU5LA . You are receiving this because you were mentioned.Message ID: @.***>

CeresBarros commented 1 year ago

No worries. We'll have to revisit it soon then and make a decision (with comments ;) ).

PredictiveEcology / Biomass_borealDataPrep

Why is `imputeBadAgeModel` fitted using bad age data subset? #48