BiologicalRecordsCentre / sparta

Species Presence/Absence R Trends Analyses
http://biologicalrecordscentre.github.io/sparta/index.html
MIT License
21 stars 24 forks source link

sparta error when running occDetFunc #171

Closed djophillips closed 5 years ago

djophillips commented 5 years ago

I have been trying to run occDetFunc on R markdown in order to produce an occupancy model, however, I keep getting the error message:

Error in sparta::occDetFunc(taxa_name="Aplocera plagiata", n_iterations = 20000, : some visits have been lost

does anyone know what causes this error and how I may go about fixing it?

thanks

drnickisaac commented 5 years ago

Hi there, To understand the error, check out line 212 of occDetFunc, which merges the occDetdata and spp_vis objects that you would have passed in: https://github.com/BiologicalRecordsCentre/sparta/blob/087b5f5e8726e2fb98661eed564fe2d4b7754c88/R/occDetFunc.r#L211

occDetdata and spp_vis should have the same number of rows, each with a visit column that is common to both. So the merged object should have the same number of rows as the original. The error is triggered when the number of rows has changes, which would indicate something has gone wrong in the formatting.

In my experience the error has only been triggered when the occDetdata and/or spp_vis objects have been edited by the user, and never when using the output directly from formatOccData.

What did you do to produce the occDetdata and spp_vis objects going into occDetFunc?

djophillips commented 5 years ago

Hi,

I have been looking over my code and re-run it with certain variables removed to see if that helps, which it has not. To produce the occDet data and spp_vis I formatted my data sets to have the same column numbers and removed any duplicates within the data using sparta.

After this, I formatted the list lengths of the data so that they would more accuratley match as one of my data sets is vastly larger than the other, after this I just remerged the two data sets to make the occDet and spp_vis. Both of these objects are showing on R to have 1013318 obs. so I am unsure how to fix the issue considering they both have the same row numbers, surely this means that no visits have been lost?

I am currently trying to Knit my R markdown as a HTML which I will then post here to see if anyone can see an issue in my coding.

thanks

AugustT commented 5 years ago

@djophillips can you check that the visit and species columns also match? All the visits and species should be present in both tables, spellings should be that same, and class (ie both strings not one as a factor). This could cause problems when you try to merge them.

djophillips commented 5 years ago

Is there a quick way I can check that the columns match? The data set I am running has over 2.3 million obs. so manually looking through would take quite a while. thanks

AugustT commented 5 years ago
a <- sample(letters, 15)
b <- sample(letters, 15)

a
b

setdiff(a, b)

setdiff can be used to find elements that do not match across 2 vectors. If you replace a and b with you species column from each dataset, and then your visit column from each dataset, that should help.

Can you also paste the output of str(occDetdata) and str(spp_vis) (the data arguments you are putting into the function)?

@drnickisaac could we change the merge to an all = FALSE? We could have a warning to tell people sites/visits had been dropped because they don't occur in both data sets.

djophillips commented 5 years ago

when running the set diff for the visit columns I get the response "character (0)", but my spp_vis data set only has the vist column and the 46 species column whilst the occDetdata has a visit, site, L and TP column- should there also be a species column? If so I will paste the code I used to format my data and combine data sets.

The output for str(occDetdata) is:

'data.frame': 1013318 obs. of 4 variables: $ visit: chr "HP61HP6015 1974-07-011" "HP61HP6112 1914-06-011" "HU31HU3716 1959-08-011" "HU31HU3915 1960-08-141" ... $ site : Factor w/ 2675 levels "HP61","HU31",..: 1 1 2 2 2 3 3 3 4 5 ... $ L : int 1 1 1 1 2 1 1 1 1 2 ... $ TP : int 75 15 60 61 63 63 63 63 65 65 ...

The output for the str(spp_vis) is:

'data.frame': 1013318 obs. of 46 variables: $ visit : chr "HP61HP6015 1974-07-011" "HP61HP6112 1914-06-011" "HU31HU3716 1959-08-011" "HU31HU3915 1960-08-141" ... $ Aplocera plagiata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Camptogramma bilineata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Chloroclysta siterata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Cidaria fulvata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Colostygia multistrigaria: logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Colostygia pectinataria : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Cosmorhoe ocellata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Dysstroma citrata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Dysstroma truncata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Electrophaes corylata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Entephria caesiata : logi FALSE FALSE FALSE FALSE TRUE FALSE ... $ Epirrhoe alternata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Epirrita dilutata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Euchoeca nebulata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Eulithis populata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Eulithis testata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Eupithecia abbreviata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Eupithecia absinthiata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Eupithecia centaureata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Eupithecia exiguata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Eupithecia pusillata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Eupithecia satyrata : logi FALSE TRUE FALSE FALSE FALSE FALSE ... $ Eupithecia subfuscata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Eupithecia tantillaria : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Eupithecia vulgata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Gymnoscelis rufifasciata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Hydriomena furcata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Hydriomena impluviata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Lampropteryx suffumata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Lobophora halterata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Odezia atrata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Operophtera brumata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Operophtera fagata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Pasiphila rectangulata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Perizoma albulata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Scotopteryx chenopodiata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Scotopteryx luridata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Scotopteryx mucronata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Thera obeliscata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Trichopteryx carpinata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Xanthorhoe decoloraria : logi TRUE FALSE TRUE TRUE TRUE TRUE ... $ Xanthorhoe designata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Xanthorhoe fluctuata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Xanthorhoe montanata : logi FALSE FALSE FALSE FALSE FALSE FALSE ... $ Xanthorhoe spadicearia : logi FALSE FALSE FALSE FALSE FALSE FALSE ...

AugustT commented 5 years ago

What you have looks sensible. Here is a reprex to help debugging. Try flipping the order of your arguments to setdiff as the order is important. I'm still thinking that there is something different in your visits column between these two datasets. You are right that there should not be a species column in spp_vis.

Try going through this with you own data and see if it helps reveal the problem.

set.seed(123)

library(sparta)

# Create data
n <- 15000 #size of dataset
nyr <- 20 # number of years in data
nSamples <- 100 # set number of dates
nSites <- 50 # set number of sites

# Create somes dates
first <- as.Date(strptime("2010/01/01", "%Y/%m/%d"))
last <- as.Date(strptime(paste(2010+(nyr-1),"/12/31", sep=''), "%Y/%m/%d"))
dt <- last-first
rDates <- first + (runif(nSamples)*dt)

# taxa are set as random letters
taxa <- sample(letters, size = n, TRUE)

# sites are visited randomly
site <- sample(paste('A', 1:nSites, sep=''), size = n, TRUE)

# the date of visit is selected at random from those created earlier
survey <- sample(rDates, size = n, TRUE)

# Format the data
visitData <- formatOccData(taxa = taxa, site = site, survey = survey)

str(visitData$occDetdata)
str(visitData$spp_vis)

# Create an extra visit in occDetData, not in sp_vis
new_visit <- c('A102010-01-061999', 'A1', 2, 2010)
visitData$occDetdata <- rbind(visitData$occDetdata, new_visit)

# run the model with these data for one species (very small number of iterations)
results <- occDetFunc(taxa_name = taxa[1],
                      n_iterations = 50,
                      burnin = 15,
                      occDetdata = visitData$occDetdata,
                      spp_vis = visitData$spp_vis,
                      write_results = FALSE)

# Find the missmatch
setdiff(visitData$occDetdata$visit, visitData$spp_vis$visit)
setdiff(visitData$spp_vis$visit, visitData$occDetdata$visit)

# here is the lines that are failing in the code
nrow1 <- nrow(visitData$occDetdata)
taxa_name <- 'a'
occDetdata <- merge(visitData$occDetdata, visitData$spp_vis[,c("visit", taxa_name)])
if(nrow1 != nrow(occDetdata)) stop('some visits have been lost')

# the merged dataset has less rows because the visit I added is only in one of them
visitData$occDetdata$visit[!visitData$occDetdata$visit %in% occDetdata$visit]
djophillips commented 5 years ago

Before running the code you sent me, I have tried re-running my script on a much faster computer (which I have borrowed) to see if that would help. Now it seems when I get to running:

formattedOccData_BC <- sparta::formatOccData(taxa = as.factor(bc2format$species), site = as.factor(bc2format$km10grid), survey= as.character(paste(bc2format$km1grid,bc2format$dateFake)), closure_period = as.factor(bc2format$year))

R just buffers the code and doesn't show any results even after leaving it for an hour or so but will run the code when using my other data set. To put it into perspective, my computer takes up to an hour to manage the large data drops whereas this one takes around 5 or less minutes. When I stop running the code for the above I get the message "no loop for break/next, jumping to top level"- is there anyway for me to prevent this error/get the script to run. Sorry for the deviation in query, it's just that runnining it on the better computer means I can check through my code in 20 minutes as aposed to a couple of hours.

AugustT commented 5 years ago

What do you mean by 'R just buffers the code'?

Are you able to share a reproducible example, maybe with just a small amount of your data that creates the same error? It is very difficult to help without an example to work with.

You say 'but will run the code when using my other data set', what are the two data sets and how are they different?

I thought you were not using formatOccData in the original script, did I miss understand?

djophillips commented 5 years ago

I mean that the script doesn't finish running or produce an error message after a significant amount of time. I have included a screenshot of what the problem is. The two data sets are the occurence data from two different sources, one has around 19,000 obs whilst the other has about 22 million- they both show the species name, location, km10 gridsquares and km1 gridsquares. I am using formatOccData in order to format the two data sets so that they will be able to be used when making my occupancy models- I will add a script- however I only have an export as an html which is not suppoted (when I try to attatch). rstudiopaste

AugustT commented 5 years ago

Okay, well 22 million is a VERY big dataset. I have not tried running this function with a dataset so large before. I suspect it will use a lot of memory, and if your memory runs out your computer will start to write to disc which will be slow. With datasets larger than about 2 million I use the NERC High Performance Computing System to do this step because my PC cannot handle the memory requirements.

I'm trying to think of ways around this...

I'm pretty sure you could split up your dataset into smaller chunks, say 1million records in each, as long as you split by your closure period. For example if your closure period is a year do 1970-75, then 76-80, 81-85, etc etc. Once you have done this you could rbind the results together.

I dont see a reason why this wouldn't work, since formatOcData doesn't do anything that needs all the data at once (@drnickisaac might want to check my thinking). This would need you to hold all of the outputs in memory to bind them together at the end, whether that is a problem depends on the size of your PC and the size of the dataset.

Looking further down the line, an occupancy model with this much data is going to take a long time to fit. Do you have access to High Performance Computing?

djophillips commented 5 years ago

Apologies, I was meant to say 2.2 million, I can ask my supervisors if I can get access to high performance computing. I could try splitting my data set in half and using rbind, as you suggested, and see if that works. thanks

djophillips commented 5 years ago

Hi, just to update. Below is what I believe to be the issue with running my occupancy model, when sparta merges the vis and det data sets the det gains 30 extra visits:

nrow2 <- nrow(gloccDet) occDetdatag<- merge(gloccDet, glsppvis[,c("visit", "Aplocera.plagiata")])

print(nrow1) #1013318 nrow(occDetdata) #1013348

Is there a way to prevent these extra 30 occuring? To note,I have run it on another computer and there has been no issue but am unable to run each of my models on that computer as it does not belong to me, the only difference is they use an older version of sparta but I don't see why this would be an issue.

Thanks

djophillips commented 5 years ago

Hi,

After running setdiff which pointed out that there were duplicates when combining the data through sparta, I ran the code:

gloccDet$visit[duplicated(gloccDet$visit)]

which has pointed out the duplicate issues within my data which I have removed, I am now able to run the occupancy model.

thank you both for the help and advice you have given me throughout this thread- I greatly appreciate it (:

AugustT commented 5 years ago

Fantastic, I am really glad you were able to solve this.

I will create a new issue suggesting that we test for this duplication in users data so that if someone else has this problem they will be able to find a solution more quickly.

Tom