DistanceDevelopment / Distance

Simple distance sampling analysis
GNU General Public License v3.0
9 stars 8 forks source link

Bootstrapping a stratified design fails under usual conditions #157

Open erex opened 1 year ago

erex commented 1 year ago

There appears to be a tacit assumption by bootdht regarding naming of Sample.Label. Based upon question from the list, bootdht will return a data frame with 0 columns and 0 rows when Sample.Label names are not unique across strata.

Usual numbering for transects in two strata might be 1, ..., 20 in Stratum 1 and 1, ..., 20 in Stratum 2. If this is the nature of the data provided to bootdht it will fail with the resul of 0 columns and 0 rows in the returned data frame.

Workaround solution is to append the Region.Label onto the Sample.Label, as in

easy$Sample.Label <- paste0(easy$Sample.Label, "-", substr(easy$Region.Label, 5, 8))

Nevertheless, this behaviour either needs to be changed or documented. I encountered this problem elsewhere and mentioned it in this issue

LHMarshall commented 9 months ago

@erex do you have code to reproduce this example? I tried to reproduce it based on your description but was unsuccessful. I couldn't see code to reproduce it in the other referenced issues either but maybe I missed something. I have also tested when both Region.Label and Sample.Label are either both numeric or both character and it runs fine under both scenarios. Thanks.

# generate some data to test on
set.seed(753)
dat <- data.frame(object = 1:60, Sample.Label = rep(1:10,6),
                  Area = 100, Effort = 1000)
dat$Region.Label <- c(rep("StrataA", 30), rep("StrataB", 30))
dat$distance <- abs(rnorm(nrow(dat), 0, 25))
dat$size <- rpois(nrow(dat), 20)

# Check we have repeated Sample.Label's over strata as described above
unique(dat[,c("Region.Label", "Sample.Label")])

conversion.factor <- convert_units("meter", "meter", "square kilometer")

fit.ds <- ds(data=dat,
             truncation=50,
             key="hn",
             adjustment=NULL,
             convert_units=conversion.factor,
             formula=~size)

set.seed(225)
  easybootn <- bootdht(model=fit.ds,
                                        flatfile=dat,
                                        summary_fun=bootdht_Nhat_summarize,
                                        convert_units=conversion.factor,
                                        sample_fraction=1,
                                        nboot=3, cores=1,
                                        progress_bar = "none")

  # Make check table
  check.tab <- easybootn %>% dplyr::group_by(Label) %>%
    dplyr::summarize(LCI=quantile(Nhat,probs=0.025,na.rm=TRUE),
                     UCI=quantile(Nhat,probs=0.975,na.rm=TRUE))

check.tab
# A tibble: 3 × 3
  Label       LCI     UCI
  <chr>     <dbl>   <dbl>
1 StrataA  64763. 106952.
2 StrataB  60080. 100407.
3 Total   124844. 207359.
LHMarshall commented 9 months ago

This situation does cause issues but not the issue described above. When the sampler ID's are not distinct across strata then object ID's are all associated with each occurrence of the sampler ID irrelevant of which strata the object was associated with even if it differs from the sampler ID.

LHMarshall commented 9 months ago

Add some documentation to bootdht to say that transect ID's need to be unique across all strata (not just within strata)

LHMarshall commented 9 months ago

When this is fixed remember to remove the additional text from the manual.

LHMarshall commented 9 months ago

Demonstrating the code failing

# These are the resampled transects
Browse[2]> samples
$StrataA
 [1]  3  3 10  2  6  5  4  6  9 10

$StrataB
 [1]  5  3  9  9  9  3  8 10  7 10

Browse[2]> n
debug at C:/Users/lhm/Documents/Github/Distance/R/bootdht_resample_data.R#44: obs_per_sample <- obs_per_sample[unlist(samples)]
Browse[2]> n
debug at C:/Users/lhm/Documents/Github/Distance/R/bootdht_resample_data.R#47: obs <- obs_per_sample

#These are the object ID's extracted for each of the resampled transects. 

Browse[2]> obs_per_sample
$`3`
[1]  3 13 23 33 43 53

$`3`
[1]  3 13 23 33 43 53

$`10`
[1] 10 20 30 40 50 60

$`2`
[1]  2 12 22 32 42 52

$`6`
[1]  6 16 26 36 46 56

$`5`
[1]  5 15 25 35 45 55

$`4`
[1]  4 14 24 34 44 54

$`6`
[1]  6 16 26 36 46 56

$`9`
[1]  9 19 29 39 49 59

$`10`
[1] 10 20 30 40 50 60

$`5`
[1]  5 15 25 35 45 55

$`3`
[1]  3 13 23 33 43 53

$`9`
[1]  9 19 29 39 49 59

$`9`
[1]  9 19 29 39 49 59

$`9`
[1]  9 19 29 39 49 59

$`3`
[1]  3 13 23 33 43 53

$`8`
[1]  8 18 28 38 48 58

$`10`
[1] 10 20 30 40 50 60

$`7`
[1]  7 17 27 37 47 57

$`10`
[1] 10 20 30 40 50 60

# We can see that objects associated with the transect ID in both strata are being selected

Browse[2]> bootdat[bootdat$object %in% obs_per_sample$'3',]
   object Sample.Label Area Effort Region.Label  distance size ref.object
3       3            3  100   1000      StrataA 52.208953   25          3
13     13            3  100   1000      StrataA  5.462644   26         13
23     23            3  100   1000      StrataA 31.673722   17         23
33     33            3  100   1000      StrataB 19.394133   16         33
43     43            3  100   1000      StrataB  6.637318   18         43
53     53            3  100   1000      StrataB 32.102037   13         53