CornellLabofOrnithology / ebird-best-practices

Best Practices for Using eBird Data
https://CornellLabOfOrnithology.github.io/ebird-best-practices/
Other
33 stars 13 forks source link

lc_extract_pred stagnates when running #6

Closed lime-n closed 4 years ago

lime-n commented 4 years ago

In 03 covariates. I find that when I enter this code:

> lc_extract_pred <- landcover[[paste0("y", max_lc_year)]] %>% 
+     exact_extract(r_cells, progress = FALSE) %>% 
+     map(~ count(., landcover = value)) %>% 
+     tibble(id = r_cells$id, data = .) %>% 
+     unnest(data)

It stagnates when runtime reaches 100%, I have tried it with progress = TRUE, and left it running for several hours.

I have read suggestions on using data.table instead of the dataframe format of exact_extract, however, I am just learning so it is taking time to figure out. Could it be that it is actually running, and that the file is very large? my r_cells is ~3.4gb and landcover is ~40mb, I also have enough memory to take a file that is ~7gb. I am taking a guess here but it may be that exact_extract and unnest are the bottlenecks that prevent a successful run.

mstrimas commented 4 years ago

I suspect you have a much larger region than we're working with in the book and therefore exact_extract() has to process a much larger amount of data. unnest() may also be a problem. You could split apart the two pieces to see which is the bottleneck, e.g.

lc_extract_ext <- landcover[[paste0("y", max_lc_year)]] %>% 
  exact_extract(r_cells, progress = FALSE)
lc_extract_cnt <- map(lc_extract_ext, ~ count(., landcover = value)) %>% 
  tibble(id = r_cells$id, data = .)
lc_extract_pred <- unnest(lc_extract_cnt, data)

However, I unfortunately don't have the time to figure out the best way to optimize this for large regions. exact_extract() is the fastest option within R, so there's no much you can do there apart from using a smaller region. For unnest() there may be faster options, e.g. with data.table, but you'll have to investigate that on your own.

lime-n commented 4 years ago

The dataset is of Australia, I am currently trying to interpret and understand species distribution around Australia.

The first line of code works, which is a relief to know that exact_extract is not an issue, however, it seems the stagnation occurs when using map, I suspect it is ~ count(., landcover = value) that slows down the process.

EDIT: It eventually worked, I decided to wait longer than intended. However, I find that breaking up the code is far better, as I have learnt more about data.table and how it can be used, to reduce wait time.

Thank you!