Closed TimBMK closed 1 year ago
Observations
usethis::use_package("future")
plan
for the users. We can somehow import future::plan()
and mention in the doc that the users could choose their plan
.Good points. I think setting or not setting the plan() for users is a matter of user friendliness vs customization. For the majority of users it may be decisively more convenient to set the workers within the functions or have them set automatically. There may be a small number of users running bind_tweets() on clusters, but that might be hypothetical. Due to the nature of the function, I also do not believe there will be any meaningful implementations into other functions. If anyone wishes to bind the tweets, they will/should use the dataframe it produces, rather than re-running bind_tweets() in each of their function calls.
Changing the script so that the plan is not set within is not a problem, and I definetly missed undoing the plan() session. But I think some users may shy away from using the parallelization if it requires them to set up the plan() themselves. It would at least need a small tutorial in the help file, I think. What do you think?
Should we add usethis::use_package("future") somewhere in the R files? If so, where?
Finally, a small note for anyone trying out the current build: due to how furrr operates, you need to install() rather than load_all() the build. Otherwise, the workers will be unable to find certain helper functions like .flat()
Edit: Would it be a solution to give users the choice whether or not they want to manually set the plan(), with auto setting it as the default?
@TimBMK No, you run this in your R console: usethis::use_package("future")
So that the dependency of future is added to DESCRIPTION
; commit that, and then the checks would probably pass.
It took me a while to get around and implement these things. The dependancy should be added now. Regarding the planning of the session, I've added an option to automatically set the plan (and re-set it after the deed is done for a clean exit). This is the default option for convenience, but the option to set up a session with plan() yourself is available and documented.
Thanks a lot for all your work on this, @TimBMK. I've reviewed this morning. We're getting errors when building because:
bind_tweets(system.file("extdata", "tweetdata", package = "academictwitteR"), output_format = "tidy")
is giving an error:
Error in .flat(data_path, output_format = output_format, parallel_workers = parallel_workers) : object 'auto_set_plan' not found
Could you quickly explain what's going on here?
Oops, my bad. I copy pasted the code from another branch (#299) and forgot to include the new variable at the bind_tweets level. It should work now
The piece of code you reference does two things: a) it checks if the session is set automatically or through a user-defined plan-session (auto_set_plan = F), then it checks if there is more than one core involved - otherwise it is not necessary to plan a session, as furrr flawlessly works without a multisession-setup (essentially becoming purrr). b) it makes sure to undo the plan session on exit of the function if it is set automatically, to avoid messing with other furrr functions / plan sessions, and allow R to reallocate the ressources. This is in accordance with future's best practice guide
That makes sense. We're still getting errors on build. I can't immediately see why but the same auto_set_plan
error is rearing up
Second fix, now I should've gotten them all. The build works for me now.
A note on additional testing: due to how future/furrr works, it is necessary to devtools::install() the build. With load_all(), furrr will be unable to find the associated functions
It seems we're failing R CMD CHECK Actions on here because internally this is calling devtools::load_all()
and so the furrr
support is not loading. We are also failing locally with:
Error in library(httpest) : there is no package called ‘httpest’
I am not sure about httpest, as I have not used this package in my builds. I've had problems with these checks on all my commits, and I am unsure how to fix them. @chainsawriot suggested it might be an issue with secret objects/tokens in your environment that I cannot reproduce? Any suggestions to make them pass are welcome
Edit: the missing furrr dependancy that caused some of the fails should be fixed now, but I am not sure about the others
I once again beg you to make this an option, not default; it can't pass the manual CRAN check with a default plan like this.
If you want to make the plan a default for you, make auto_set_plan
an environment variable/option (I don't know, ACADEMICTWITTER_CPUS
, like TESTTHAT_CPUS
), and don't use parallel::detectCores().
Do you suggest setting auto_set_plan = FALSE as the default or removing the option alltogether @chainsawriot ?
There is nothing in the codebase that points to httpest
. The correct name is httptest
.
There is nothing in the codebase that points to
httpest
. The correct name ishttptest
.
I am still not sure where this error is coming from, as this build does does not utilize httptest, and I am not sure where the library() calls are being done. One way or the other, it seems like a typo?
parallel_workers = Sys.getenv("ACADEMICTWITTER_CPUS"), auto_set_plan = Sys.getenv("ACADEMICTWITTER_CPUS") != ""
So, if ACADEMICTWITTER_CPUS
is "" (not set), no plan as default (CRAN and "ordinary users").
If ACADEMICTWITTER_CPUS
is a number (str->number) and auto_set_plan
is also the default, the default is creating a plan. (Your case)
If ACADEMICTWITTER_CPUS
is a number (str->number), but auto_set_plan
is FALSE, no plan.
I do not see the added benefit of this approach over falling back to the suggested option to have users set plan() themselves, apart from the very specific case where you want to have different plan() settings for academictwitteR than for other applications. The idea was to speed up the function with a user/beginner friendly default setting that does not require additional commands. If you think this will not pass CRAN checks even when following future's best practices (clean exit and an option for users/package developers to circumvent the planning), I would rather drop the auto_set_plan option alltogether and force users to plan() themselves
@TimBMK Let @cjbarrie decide. I am not the maintainer.
I would say it is not too much to ask to ask users to set plan themselves. I would prefer this option than have to face the hair-pulling of circumenting CRAN checks. As long as we include clear vignette/manual documentation for how to specify this option, then I'm comfortable people will be able to use.
I recognize it's not the optimal solution--and I would prefer to have this obvious improvement as default, but it seems we'll have to wait for things on the CRAN side to change before we can smooth this out.
Alright, I'll whip something up that only utilizes manual plan() settings and documents the code accordingly
I've removed auto-setting the workers, users now need to manually specify a multisession via plan(). This should be adequately reflected in the documentation, but please take another look here. I've also run some benchmarks:
Unit: seconds
expr min lq mean median uq max neval
sequential_hk 2.408233 2.420916 2.469018 2.445454 2.531645 2.538841 5
sequential_hk2 6.139798 6.157054 6.216220 6.158616 6.270495 6.355139 5
sequential_large 19.432038 19.457115 19.634191 19.551847 19.844659 19.885297 5
Unit: seconds
expr min lq mean median uq max neval
mutli_hk 0.8616603 0.8765583 0.8928872 0.8846187 0.9070763 0.9345226 5
multi_hk2 1.9653210 2.0739980 2.1111773 2.1244037 2.1764602 2.2157038 5
multi_large 4.3413250 4.4966270 4.5954254 4.6239296 4.7334721 4.7817733 5
I've run these on a windows machine with 6 threads. The "large" dataset is a sample of 2.413 tweets. As you can see, performance gains are significant for all datasets, but become more noticeable for larger data. I've used the microbenchmark package for the benchmarking.
Do you think it is necessary to write additional tests for the parallel sessions? In theory, the parallel mapping behaves exactly like the sequential map_dfr() function and all necessary tests should have been conducted by furrr
That's super. Thanks, @TimBMK. This now looks good to me. I've added a few lines to the vignette documentation too. There is still this text-coverage fail that we need to work out and I will look into next
Okay so it's nothing to do with the secret env variables. It seems there's just some errors being thrown from our #331 test-coverage Action.
@chainsawriot can you see any obvious reason this is happening? The error seems to be related to the ubuntu tests:
The repository 'https://ppa.launchpadcontent.net/cran/travis/ubuntu jammy Release' does not have a Release file.
These should now pass, @TimBMK if you incorporate #366 which just merged into master
I've implemented the upstream changes from Master, but test-coverage is still failing. Any ideas?
Error: Error: Failure in
/home/runner/work/_temp/package/academictwitteR/academictwitteR-tests/testthat.Rout.fail
There are 3 failed test cases; but I can confirm that the current master
also can't pass those tests. Probably the problem is #355
I don't have time to plug #355 ; but I can disable the tests that check for silence.
setwd("tests/testthat"); testthat::test_file("test-hydrate.R", package = "academictwitteR", load_package = "source"); setwd("../../")
I have now merged the disabling of tests waiting for silence #369. Thank you, @chainsawriot
Latest changes (silencing) was merged into this branch, but the test-coverage fail seems to be the same as before
Implementation of furrr parallel processing for bind_tweets. Effectively, this allows for parallel processing when reading in data via bind_tweets, speeding things up especially when it comes to large data.
Changes are minimal, they effectively replace purrr::map_dfr() with furrr::future_map_dfr() in .flat() and set up the required multisession if more than one thread is used. As future_map_dfr() is a drop-in replacement of map_dfr, this should behave exactly the same and be 100% compatible when used on a single thread.
Dependencies parallel (for detectCores()), future (for plan()) and furrr (for future_map_dfr()) are introduced.