BiologicalRecordsCentre / BRCindicators

An R package for creating indicators from trends data
4 stars 11 forks source link

Sorting problem in msi_tool when years span orders of magnitudes #23

Closed MartinStjernman closed 5 years ago

MartinStjernman commented 5 years ago

Dear developers,

I've detected a problem in the function msi_tool() when supplying index data where year is given as e.g. 1:10 instead of say 2001:2010. The problem appears when INP is merged with INP1 (lines 51:52 in the code as far as I can see). While I don't fully understand the rationale, this step is apparently done to sort data (INP) in species order. The step merges INP (with species in original order) and INP1 (with species in alphabetical order) by species and year and with statement sort = TRUE. The problem is that merge() sort on the variables in the by-statement in LEXICOGRAPHICAL order and if one has years f ex 1:10 the order becomes c(1, 10, 2, 3, 4, 5, 6, 7, 8, 9). This ordering is mainained throughout the script resulting in an output$results sorted in the same way. One fairly simple solution to allow years to be given as f.ex 1:10 (and still maintain this sorting step of data in species alphabetical order) would be to change the merging of INP and INP1 according to the following:

INP2 <- merge(INP1, INP, by = c("species", "year"), sort = FALSE, all = TRUE)

In that way INP2 would be sorted according to INP1 which I guess is what was intended. Of course, there are other ways to sort data in species order avoiding merge() and that would be less error prone (using order() for instance).

Admittedly, I don't know how common it is to supply years in this format (e.g. 1:10). However, I could not find any information in the documentation that this is not allowed and with noisy data (i.e. where index move up and down over years) the error would be difficult to see. I was fortunate to have data (simulated) where I knew index declined regularly each year and hence could detect the problem.

I hope this was of some help.

Sincerely, Martin

AugustT commented 5 years ago

@MartinStjernman This is very helpful indeed. The years 1:10 seems a reasonable input so should be supported. Would you like to put in a pull request to gain credit for this? Else I will implement in the coming weeks.

Out of interest, how did you find out about this package?

MartinStjernman commented 5 years ago

@AugustT I'm afraid I have no experience of pull requests and am therefore quite satisfied with you making the implementation. As I wrote earlier, I'm not sure my solution is the optimal one and am not even sure why the sorting (in species alphabetical order) is made in the first place since, after a quick read-through of the code, I cannot find any reason it would be necessary to change species order from what is supplied in the input. In fact, it seems that the alphabetical order is used in the meanCV-plot produced but species names are not added to the xlab of that plot so for people unaware of the species sorting (and who supply data in non-alphabetical order) it would be very difficult to associate meanCV values to particular species in the plot.

As for how I found the package: My collegues (who work on bird monitoring) told me they used a function called msi (as it turned out they used the msi_tool-script from Statistics Netherlands, CBS) and I just googled msi and finally found your GitHub repository and the package. After checking I realized that it was much simpler to use msi than the original CBS msi_tool-script.

AugustT commented 5 years ago

Tom's notes:

A little context - My work in BRCincators was to create a simple wrapper around some existing code created by Statistics Netherlands.

So this issue is present in the original code https://www.cbs.nl/en-gb/society/nature-and-environment/indices-and-trends--trim--/msi-tool - which means the issue is larger than the implementation here in BRCindicators. I will inform the original authors of this issue in case they want to amend their script. I have always tried to keep my implementation here as close to the original authors code as possible as I want the results to be the same, however I think I need to address this issue.

Looking at the original documentation there is also no mention of the need for years to be of the format YYYY, though the examples do use that format.

I agree it is VERY odd that we have INP INP1 and INP2. It seems the sole role of INP and INP1 is to create INP2 which would be much more easily and efficiently created with order.

I have tried to replicate your issue but it does not seem to occur for me

library(BRCindicators)

set.seed(123)

# Create dummy data 
species = rep(letters, each = 50)
year = rev(rep(1:50, length(letters)))
index = runif(n = 50 * length(letters), min = 1, max = 10)
se = runif(n = 50 * length(letters), min = 0.01, max = .8)

data <- data.frame(species, year, index, se)

# Run the MSI function
msi_out <- msi(data, plot = FALSE)

head(msi_out$results)
  year    MSI sd_MSI lower_CL_MSI upper_CL_MSI  Trend
1    1 114.74   0.00       114.74       114.74 100.00
2    2  87.73   3.90        80.41        95.71  99.10
3    3  91.40   3.64        84.54        98.83  98.28
4    4 109.38   3.80       102.18       117.08  97.54
5    5  89.49   3.72        82.51        97.08  96.87
6    6  92.16   3.48        85.59        99.23  96.27
  lower_CL_trend upper_CL_trend trend_class
1          96.30         103.28      stable
2          96.30         102.26      stable
3          95.34         101.24      stable
4          95.34         100.23      stable
5          94.39          99.23      stable
6          94.39          98.25      stable

note that there is not an issue with the sorting of years

msi_out$results$year
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18
[19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
[37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50

@MartinStjernman Can you run this and tell me if you see something different? Since year is forced to be a integer or numeric this can't be explained by you supplying year as a character.

While I cannot replicate your example using the above I can replicate it like this

INP <- data.frame(species, year, index, se)
INP1 <- data.frame(species, year)
INP2 <- merge(INP, INP1, by=c("species","year"), sort=TRUE, all=TRUE)
head(INP2)
  species year    index        se
1       a    1 5.556199 0.5069077
2       a   10 3.803790 0.4756562
3       a   11 5.591480 0.7029742
4       a   12 8.703510 0.0623581
5       a   13 5.192648 0.4879276
6       a   14 9.656931 0.6621375

But I cannot tell if this is an issue for the code later on. The order of years is defined elsewhere uyear <- sort(unique(year)). suggesting the order of years in INP2 should not be important.

Either way we can implement a fix

# Since INP and INP1 are only used here we can get rid of them
INP <- data.frame(species, year, index, se)
INP1 <- data.frame(species, year)
INP2 <- merge(INP, INP1, by=c("species","year"), sort=TRUE, all=TRUE)

# Suggested change
INP2 <- data.frame(species, year, index, se)[order(species, year), ]

Let me know your thoughts, in my tests this seems to work as expected, but as I mention above, I was not able to replicate your original issue, so I would like to be sure I understand your issue before I implement a fix.

MartinStjernman commented 5 years ago

@AugustT If I'm not totally mistaken, from your example it is difficult to tell whether the sorting in msi_tool has messed up your results. Yes, from msi_out$results$year you can see that years are sorted. But you need to check that each year has been associated with the right multispecies index in the output. I would guess it has not. Since I had (in my simulated data) indices that constantly decreased over time for all species I could clearly see that year 2 in my sequence of 10 years (1:10) was associated with the multispecies index of year 10 in the output. You can't easily see that in your output.

This "faulty" association is, as far as I can follow the code, created in the step when INP is merged with INP1 to create INP2. INP2 (with its faulty association between year and index) is then merged further down the code with INP3 to produce INP4 which in turn is subsetted to INP5. Note, however, that when INP2 is merged with INP3 the sort-statement is set to FALSE which means that the order of INP2 is maintained. This order is then kept in INP5 and INP5$index (which is hence sorted 1, 10, 2, 3, 4, 5 etc, in my example) is used to calculate the multispecies index and it is as far as I can see kept in that order throughout. But then these MSI's are combined with year in the order 1, 2, 3, 4, 5 etc.!

As I've already mentioned, there seems to be no reason to do this sorting in the first place and in fact it messes up the order of species plotted in the meanCV plot (since there is apparently no mentioning of species coming in alphabetical order and not in the order supplied by the user). But if the sorting is needed for some reason it can be done in a "safer" way using f ex order(). merge() is a tricky thing that needs to be kept on a short leash ;-)

I fully understand that you want to keep the code in the BRC-version of msi_tool as true as possible to the original. But in my mind, the code could do with some thorough washing as there are a number of (scary) steps in there that could have been done more efficiently (or at least more foolproof). Unfortunately, I do not have time at the moment to help out with this and ideally should be done by people that have a deeper understanding of the calculations than I have.

Thanks for your efforts in making this package. It is very much appreciated.

MartinStjernman commented 5 years ago

And, yes, as far as I can see, your fix should work. Alternatively, one could get INP2 from rdata directly (i.e. take advantage of the data.frame format where observations are kept "intact") by using f ex.

INP2 <- rdata[order(rdata$species, rdata$year), ]

I don't really follow why variables are extracted from data.frames (e.g. rdata) in lines like species <- rdata[, 'species'] year <- rdata[, 'year'] This is (imho) to ask for trouble as variables for an observation is then not kept together. But I haven't scrutinized the code enough to see whether these steps are necessary.

AugustT commented 5 years ago

These rather odd assignments:

species <- rdata[, 'species']

While odd, are there to make the code as identical as possible to the original code.

MartinStjernman commented 5 years ago

Yes, I understand and fully appreciate your desire to keep the code intact. Those remarks was more directed towards the original code.

AugustT commented 5 years ago

Thanks for your feedback. I'm going to create a repex of your issue so I can be sure this solution works.

AugustT commented 5 years ago

Okay, so here is a reproducible example of the problem, see that the sorting has moved the year 10 to be after year 1 and before year 2.

library(BRCindicators)

set.seed(123)

# Create dummy data 
nyr = 10
species = rep(letters, each = nyr)
year = rev(rep(1:nyr, length(letters)))

# Create an index value that increases with time
index = rep(seq(10, 5, length.out = nyr), length(letters))
index = index2 * runif(n = length(index2), 0.8, 1.1)

se = runif(n = nyr * length(letters), min = 0.01, max = .8)

data <- data.frame(species, year, index, se)

# see that it increases linearly
plot(data$year, data$index)

image

# Run the MSI function
msi_out <- msi(data, plot = FALSE)

plot(msi_out)

image

# The output gives no indication of the error
head(msi_out$results)
  year    MSI sd_MSI lower_CL_MSI upper_CL_MSI  Trend
1    1  83.61   0.00        83.61        83.61 100.00
2    2 178.80   4.90       169.46       188.67 115.68
3    3  99.93   3.07        94.09       106.13 119.08
4    4 104.77   3.20        98.67       111.23 100.70
5    5 115.95   3.65       109.02       123.31 115.99
6    6 127.10   4.00       119.49       135.19 124.55
  lower_CL_trend upper_CL_trend       trend_class
1          98.59         101.59 moderate_increase
2         112.28         119.22 moderate_increase
3         114.55         124.09 moderate_increase
4          96.64         104.69 moderate_increase
5         111.16         120.42 moderate_increase
6         119.22         130.45 moderate_increase
AugustT commented 5 years ago

I will now work on implementing the fix and will point the original authors to this thread so that they can address the issue in their original code and follow up on any analyses that might have been effected by this bug in the past.

AugustT commented 5 years ago

I fixed this by changing the line in msi_tool to

  INP2 <- data.frame(species = rdata[,"species"],
                     year = rdata[,"year"],
                     index = rdata[,"index"],
                     se = rdata[,"se"])[order(rdata[,"species"],
                                         rdata[,"year"]), ]

I tested this under simulated increases, declines, random, and with years 1:10 and 1991:2000, all the results looked correct.

image