iesm data processing match() error

sbassett commented 2 years ago

start raster processing year 2010 ; l 6 Forest ; v 2 Soil Wed Jul 28 13:48:39 2021 
Error in scalar_out[v, dl, totyind, lc_inds] <- lc_vals : 
  NAs are not allowed in subscripted assignments

NAs occur only in lc_inds

> sum(is.na(lc_vals))
[1] 0
> sum(is.na(v))
[1] 0
> sum(is.na(dl))
[1] 0
> sum(is.na(totyind))
[1] 0
> sum(is.na(lc_inds))
[1] 47

NAs originate with lc_inds = match(lc_ids, caland_df$Land_Cat_ID)
```
> caland_df$Land_Cat_ID
[1] 800101002
> lc_ids
800103040
```
exporting both to .csv to evaluate differences between the two sets of LandCatIDs manually in Excel
- these look like there are matches for all values
following confirmation that values exist in the match tables, rerunning match() with values defined for nomatch, and incomparables.
```
> lc_inds = match(lc_ids, caland_df$Land_Cat_ID, nomatch = 9999, incomparables = NULL)
> lc_inds
[1] 9999
```
it's odd that values that are known to have matches are returning nomatch values.
Is it possible that the data type of each of these needs to be the same?
- tested unname() on the 'lc_ids' but it didn't help make the matches work...
```
> lc_inds = match(unname(lc_ids), caland_df$Land_Cat_ID, nomatch = 9999, incomparables = NULL)
> lc_inds
[1] 9999
```

sbassett commented 2 years ago

@aj1s @tchapman100 I've hit a wall on this. It's in the nwland_dev_withWLIC branch.

sbassett commented 2 years ago

here are the csvs (converted to xlsx because GitHub won't take CSVs in comments) produced for lc_ids and caland_df$Land_Cat_ID Test_Land_Cat_ID.xlsx Test_lc_ids.xlsx

sbassett commented 2 years ago

Encountered this error again today with a slightly different code and different inputs. in WLIC branch: GitHub\NWLAND\preproc\NWLAND_proc_iesm_climate_v2.r Error thrown by > scalar_out[v, dl, totyind, lc_inds] = lc_vals [line 747] Error in scalar_out[v, dl, totyind, lc_inds] <- lc_vals : NAs are not allowed in subscripted assignments

sbassett commented 2 years ago

both lc_ids and caland_df$Land_Cat_ID are double lc_ids is a "named number" caland_df$Land_Cat_ID is just a plain number

> str(caland_df$Land_Cat_ID)
 num [1:6293] 8.08e+08 8.08e+08 8.08e+08 8.08e+08 8.08e+08 ...
> str(lc_ids)
 Named num [1:59] 8.00e+08 8.01e+08 8.01e+08 8.02e+08 8.02e+08 ...
 - attr(*, "names")= chr [1:59] "800304000" "800704000" "801304000" "801504000" ...

maybe try this: https://stackoverflow.com/questions/15736719/how-do-i-extract-just-the-number-from-a-named-number-without-the-name

even matching with unname still doesn't work

> namless_lc_ids <- unname(lc_ids)
> str(namless_lc_ids)
 num [1:59] 8.00e+08 8.01e+08 8.01e+08 8.02e+08 8.02e+08 ...
> namless_lc_inds = match(namless_lc_ids, caland_df$Land_Cat_ID)
> str(namless_lc_inds)
 int [1:59] NA NA NA NA NA NA NA NA NA NA ...

sbassett commented 2 years ago

New hypothesis from @aj1s: the num represented by scientific notation (e.g. 3.03e+08) is throwing the match() off. Can try to convert the nums to ints using as.integer(X).

This would work, except for the ridiculous 32-bit representation of integers that limits the number of values to around 2 billion.

> int_lc_ids <- as.integer(lc_ids)
Warning message:
NAs introduced by coercion to integer range

Will try to match on text strings.

> tail(char_lc_ids)
[1] "3503904000" "3504504064" "3504704000" "3505304064" "3505504000" "3506104064"
> char_matchtest_single <- match("800304000", char_thing)
> (char_matchtest_single)
[1] NA
> char_matchtest_single <- match("3503904000", char_thing)
> (char_matchtest_single)
[1] NA
> which(char_lc_ids %in% char_thing)
integer(0)
> char_lc_ids %in% char_thing
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[27] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[53] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

No dice!

sbassett commented 2 years ago

Attempting to match a character representation of the thing to itself to see if produced valid output.

> char_matchtest_self <- match(char_thing, char_thing)
> char_matchtest_self
   [1]    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   3

Self match works on both files. Now considering the easiest explanation (that there aren't actually matches).

sbassett commented 2 years ago

There are indeed no matches.

@aj1s you were right to be skeptical, I shouldn't have trusted my prior review on a different dataset.

At least in the char_lc_ids and char_thing, there are no matches as tested by exporting CSVs and "find"ing values on Excel from char_lc_ids in char_thing.

It appears that land ownership codes are getting messed up in the lc_ids vector. The values in that vector end in either '000' or '064'. Neither code is valid (see https://github.com/TNC-NMFO/NWLAND/issues/83#issuecomment-930469450).

sbassett commented 2 years ago

I'm curious if its a result of the 32-bit integer problem. I'll assign a twodigit code for each county/region (since there are only 98 of them), and reproduce the landcat grid.

sbassett commented 2 years ago

merge() error resolved with two-digit county/region codes. New error received, will open new issue.

end raster processing year 2010 ; l 9 Forest ; v 2 Soil Tue Oct 05 00:14:30 2021 
Error in `$<-.data.frame`(`*tmp*`, "Component", value = "Soil") : 
  replacement has 1 row, data has 0

TNC-NMFO / NWLAND

iesm data processing match() error #80