cjbarrie / academictwitteR

Repo for academictwitteR package to query the Twitter Academic Research Product Track v2 API endpoint.
Other
272 stars 59 forks source link

Furrr support #354

Closed TimBMK closed 1 year ago

TimBMK commented 2 years ago

Implementation of furrr parallel processing for bind_tweets. Effectively, this allows for parallel processing when reading in data via bind_tweets, speeding things up especially when it comes to large data.

Changes are minimal, they effectively replace purrr::map_dfr() with furrr::future_map_dfr() in .flat() and set up the required multisession if more than one thread is used. As future_map_dfr() is a drop-in replacement of map_dfr, this should behave exactly the same and be 100% compatible when used on a single thread.

Dependencies parallel (for detectCores()), future (for plan()) and furrr (for future_map_dfr()) are introduced.

chainsawriot commented 2 years ago

Observations

TimBMK commented 2 years ago

Good points. I think setting or not setting the plan() for users is a matter of user friendliness vs customization. For the majority of users it may be decisively more convenient to set the workers within the functions or have them set automatically. There may be a small number of users running bind_tweets() on clusters, but that might be hypothetical. Due to the nature of the function, I also do not believe there will be any meaningful implementations into other functions. If anyone wishes to bind the tweets, they will/should use the dataframe it produces, rather than re-running bind_tweets() in each of their function calls.

Changing the script so that the plan is not set within is not a problem, and I definetly missed undoing the plan() session. But I think some users may shy away from using the parallelization if it requires them to set up the plan() themselves. It would at least need a small tutorial in the help file, I think. What do you think?

Should we add usethis::use_package("future") somewhere in the R files? If so, where?

Finally, a small note for anyone trying out the current build: due to how furrr operates, you need to install() rather than load_all() the build. Otherwise, the workers will be unable to find certain helper functions like .flat()

Edit: Would it be a solution to give users the choice whether or not they want to manually set the plan(), with auto setting it as the default?

chainsawriot commented 2 years ago

@TimBMK No, you run this in your R console: usethis::use_package("future")

So that the dependency of future is added to DESCRIPTION; commit that, and then the checks would probably pass.

TimBMK commented 1 year ago

It took me a while to get around and implement these things. The dependancy should be added now. Regarding the planning of the session, I've added an option to automatically set the plan (and re-set it after the deed is done for a clean exit). This is the default option for convenience, but the option to set up a session with plan() yourself is available and documented.

cjbarrie commented 1 year ago

Thanks a lot for all your work on this, @TimBMK. I've reviewed this morning. We're getting errors when building because:

bind_tweets(system.file("extdata", "tweetdata", package = "academictwitteR"), output_format = "tidy")

is giving an error:

Error in .flat(data_path, output_format = output_format, parallel_workers = parallel_workers) : object 'auto_set_plan' not found

Could you quickly explain what's going on here?

TimBMK commented 1 year ago

Oops, my bad. I copy pasted the code from another branch (#299) and forgot to include the new variable at the bind_tweets level. It should work now

TimBMK commented 1 year ago

The piece of code you reference does two things: a) it checks if the session is set automatically or through a user-defined plan-session (auto_set_plan = F), then it checks if there is more than one core involved - otherwise it is not necessary to plan a session, as furrr flawlessly works without a multisession-setup (essentially becoming purrr). b) it makes sure to undo the plan session on exit of the function if it is set automatically, to avoid messing with other furrr functions / plan sessions, and allow R to reallocate the ressources. This is in accordance with future's best practice guide

cjbarrie commented 1 year ago

That makes sense. We're still getting errors on build. I can't immediately see why but the same auto_set_plan error is rearing up

TimBMK commented 1 year ago

Second fix, now I should've gotten them all. The build works for me now.

A note on additional testing: due to how future/furrr works, it is necessary to devtools::install() the build. With load_all(), furrr will be unable to find the associated functions

cjbarrie commented 1 year ago

It seems we're failing R CMD CHECK Actions on here because internally this is calling devtools::load_all() and so the furrr support is not loading. We are also failing locally with:

Error in library(httpest) : there is no package called ‘httpest’

TimBMK commented 1 year ago

I am not sure about httpest, as I have not used this package in my builds. I've had problems with these checks on all my commits, and I am unsure how to fix them. @chainsawriot suggested it might be an issue with secret objects/tokens in your environment that I cannot reproduce? Any suggestions to make them pass are welcome

Edit: the missing furrr dependancy that caused some of the fails should be fixed now, but I am not sure about the others

chainsawriot commented 1 year ago

I once again beg you to make this an option, not default; it can't pass the manual CRAN check with a default plan like this.

If you want to make the plan a default for you, make auto_set_plan an environment variable/option (I don't know, ACADEMICTWITTER_CPUS, like TESTTHAT_CPUS), and don't use parallel::detectCores().

TimBMK commented 1 year ago

Do you suggest setting auto_set_plan = FALSE as the default or removing the option alltogether @chainsawriot ?

chainsawriot commented 1 year ago

There is nothing in the codebase that points to httpest. The correct name is httptest.

TimBMK commented 1 year ago

There is nothing in the codebase that points to httpest. The correct name is httptest.

I am still not sure where this error is coming from, as this build does does not utilize httptest, and I am not sure where the library() calls are being done. One way or the other, it seems like a typo?

chainsawriot commented 1 year ago
parallel_workers = Sys.getenv("ACADEMICTWITTER_CPUS"), auto_set_plan = Sys.getenv("ACADEMICTWITTER_CPUS") != ""

So, if ACADEMICTWITTER_CPUS is "" (not set), no plan as default (CRAN and "ordinary users"). If ACADEMICTWITTER_CPUS is a number (str->number) and auto_set_plan is also the default, the default is creating a plan. (Your case) If ACADEMICTWITTER_CPUS is a number (str->number), but auto_set_plan is FALSE, no plan.

TimBMK commented 1 year ago

I do not see the added benefit of this approach over falling back to the suggested option to have users set plan() themselves, apart from the very specific case where you want to have different plan() settings for academictwitteR than for other applications. The idea was to speed up the function with a user/beginner friendly default setting that does not require additional commands. If you think this will not pass CRAN checks even when following future's best practices (clean exit and an option for users/package developers to circumvent the planning), I would rather drop the auto_set_plan option alltogether and force users to plan() themselves

chainsawriot commented 1 year ago

@TimBMK Let @cjbarrie decide. I am not the maintainer.

cjbarrie commented 1 year ago

I would say it is not too much to ask to ask users to set plan themselves. I would prefer this option than have to face the hair-pulling of circumenting CRAN checks. As long as we include clear vignette/manual documentation for how to specify this option, then I'm comfortable people will be able to use.

I recognize it's not the optimal solution--and I would prefer to have this obvious improvement as default, but it seems we'll have to wait for things on the CRAN side to change before we can smooth this out.

TimBMK commented 1 year ago

Alright, I'll whip something up that only utilizes manual plan() settings and documents the code accordingly

TimBMK commented 1 year ago

I've removed auto-setting the workers, users now need to manually specify a multisession via plan(). This should be adequately reflected in the documentation, but please take another look here. I've also run some benchmarks:

Unit: seconds
             expr       min        lq      mean    median        uq       max neval
    sequential_hk  2.408233  2.420916  2.469018  2.445454  2.531645  2.538841     5
   sequential_hk2  6.139798  6.157054  6.216220  6.158616  6.270495  6.355139     5
 sequential_large 19.432038 19.457115 19.634191 19.551847 19.844659 19.885297     5

 Unit: seconds
        expr       min        lq      mean    median        uq       max neval
    mutli_hk 0.8616603 0.8765583 0.8928872 0.8846187 0.9070763 0.9345226     5
   multi_hk2 1.9653210 2.0739980 2.1111773 2.1244037 2.1764602 2.2157038     5
 multi_large 4.3413250 4.4966270 4.5954254 4.6239296 4.7334721 4.7817733     5

I've run these on a windows machine with 6 threads. The "large" dataset is a sample of 2.413 tweets. As you can see, performance gains are significant for all datasets, but become more noticeable for larger data. I've used the microbenchmark package for the benchmarking.

Do you think it is necessary to write additional tests for the parallel sessions? In theory, the parallel mapping behaves exactly like the sequential map_dfr() function and all necessary tests should have been conducted by furrr

cjbarrie commented 1 year ago

That's super. Thanks, @TimBMK. This now looks good to me. I've added a few lines to the vignette documentation too. There is still this text-coverage fail that we need to work out and I will look into next

cjbarrie commented 1 year ago

Okay so it's nothing to do with the secret env variables. It seems there's just some errors being thrown from our #331 test-coverage Action.

@chainsawriot can you see any obvious reason this is happening? The error seems to be related to the ubuntu tests:

The repository 'https://ppa.launchpadcontent.net/cran/travis/ubuntu jammy Release' does not have a Release file.

cjbarrie commented 1 year ago

These should now pass, @TimBMK if you incorporate #366 which just merged into master

TimBMK commented 1 year ago

I've implemented the upstream changes from Master, but test-coverage is still failing. Any ideas?

Error: Error: Failure in /home/runner/work/_temp/package/academictwitteR/academictwitteR-tests/testthat.Rout.fail

chainsawriot commented 1 year ago

There are 3 failed test cases; but I can confirm that the current master also can't pass those tests. Probably the problem is #355

I don't have time to plug #355 ; but I can disable the tests that check for silence.

setwd("tests/testthat"); testthat::test_file("test-hydrate.R", package = "academictwitteR", load_package = "source"); setwd("../../")
cjbarrie commented 1 year ago

I have now merged the disabling of tests waiting for silence #369. Thank you, @chainsawriot

TimBMK commented 1 year ago

Latest changes (silencing) was merged into this branch, but the test-coverage fail seems to be the same as before