MRC-CSO-SPHSU / UC-DiD-analysis

MIT License
0 stars 0 forks source link

BUG: b3d5e58 fails to run using provided data files #2

Closed vkhodygo closed 1 year ago

vkhodygo commented 1 year ago

I tried to run the project on the workstation using the data you sent me. It fails with this message:

Error in `group_by()`:
! Must group by variables found in `.data`.
✖ Column `benefit` is not found.
Backtrace:
    ▆
 1. ├─ggplot2::ggplot(...)
 2. ├─dplyr::arrange(...)
 3. ├─dplyr::filter(...)
 4. ├─dplyr::mutate(...)
 5. ├─dplyr::summarise(...)
 6. ├─dplyr::group_by(...)
 7. └─dplyr:::group_by.data.frame(...)
 8.   └─dplyr::group_by_prepare(.data, ..., .add = .add, caller_env = caller_env())
 9.     └─rlang::abort(bullets, call = error_call)
Execution halted

What do you think went wrong?

andrewbaxter439 commented 1 year ago

Can you check if data has loaded correctly? benefit column should be created in the pivot_longer in L38-43

vkhodygo commented 1 year ago

@andrewbaxter439

I'm not actually sure since I:

  1. didn't check the code itself;
  2. run it right in the terminal via "Rscript".

I'll try it again.

vkhodygo commented 1 year ago

A brief update: one must run Rscript from the root directory, i.e., Rscript ./R/predicting_benefit_receipt.R.

In this case it fails like this:

Universal Credit - Difference in Difference project opened
── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.7     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1
── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
── Attaching packages ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom        1.0.1     ✔ rsample      1.1.0
✔ dials        1.1.0     ✔ tune         1.0.1
✔ infer        1.0.3     ✔ workflows    1.1.0
✔ modeldata    1.0.1     ✔ workflowsets 1.0.0
✔ parsnip      1.0.2     ✔ yardstick    1.1.0
✔ recipes      1.0.2     
── Conflicts ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter()   masks stats::filter()
✖ recipes::fixed()  masks stringr::fixed()
✖ dplyr::lag()      masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step()   masks stats::step()
• Use suppressPackageStartupMessages() to eliminate package startup messages

Attaching package: ‘data.table’

The following objects are masked from ‘package:dplyr’:

    between, first, last

The following object is masked from ‘package:purrr’:

    transpose

Error in `mutate()`:
! Problem while computing `marsta = factor(...)`.
Caused by error in `if_else()`:
! `false` must be a double vector, not an integer vector.
Backtrace:
     ▆
  1. ├─dplyr::select(...)
  2. ├─dplyr::mutate(...)
  3. ├─dplyr:::mutate.data.frame(...)
  4. │ └─dplyr:::mutate_cols(.data, dplyr_quosures(...), caller_env = caller_env())
  5. │   ├─base::withCallingHandlers(...)
  6. │   └─mask$eval_all_mutate(quo)
  7. ├─base::factor(...)
  8. └─dplyr::if_else(dms == 0, 1, dms)
  9.   └─dplyr:::replace_with(out, !condition, false, "`false`", "length of `condition`")
 10.     └─dplyr:::check_type(val, x, name, error_call = error_call)
 11.       └─rlang::abort(msg, call = error_call)
Execution halted
andrewbaxter439 commented 1 year ago

I found that same problem - updated dplyr I think is more strict about not accepting 1 in place of an integer, interpreting it as a double. I'll re-write code and push again tomorrow.

andrewbaxter439 commented 1 year ago

I've rewritten code in script R\optimise_predictions.R which should now run two grid search models and output results to output folder when run from a command prompt. Had been running these in RStudio (from .Rproj file) to set working directory for finding data references etc., but if works from R CMD then great :+1:

andrewbaxter439 commented 1 year ago

Should say, this is now most up to date in andrewbaxter439:dev branch

andrewbaxter439 commented 1 year ago

Right, another fault found and fixed. I'd experimented and it certainly seemed to be running serially rather than in parallel. Waste of time. Have added the below to optimise_predictions.R. L186 will run in 6 R processes in tandem - change this to up cores. Max number of folds/iterations to run would be 5*4^3 = 320.

Can you try out and see if it's running now?

https://github.com/andrewbaxter439/UC-DiD-analysis/blob/e46ecf5182f0861e5b3c65c30a59ee15019558b0/R/optimise_predictions.R#L182-L188

vkhodygo commented 1 year ago

A few comments regarding the dev branch:

vkhodygo commented 1 year ago

It seems to be working just fine, it took about 37 minutes to run. Is it good enough for you?

andrewbaxter439 commented 1 year ago

That's pretty awesome! Left my computer doing it serially overnight the other day and by mid-morning it had reached model 10 of 16 - with the more complex models still to come! So 37 is a much more manageable speedup. Can you send output files over (in output folder - should be tiny files).

andrewbaxter439 commented 1 year ago
  • I get this warning
Warning message:
Problem while computing `caring = fct_collapse(factor(lcr01), Yes = c("1", "2", "3"), No = 0)`.
ℹ Unknown levels in `f`: 2, 3 

That's a bit of lazy/nuisance coding at the moment - trying to match up variables between data sources and this set being tidied is missing levels 2 and 3 from that variable. Will tidy code to supress warning.

It works with 120 though meaning this is not a thread vs. core problem. Still, setting this number manually is a little bit inconvenient, especially when you switch between machines:

Will rewrite script to detect number of cores then scale back a little.

  • I also noticed the whole thing is quite RAM greedy: about 600-700MB per process; nothing to write home about, but good to know;

I think/presume this is because it's currently working as parallel R sessions and copying code to be used. Might try different parallelisation methods to see if there's a simpler solution (parallelising computations but not data?).

  • some other sections of the code use 64 cores instead (doesn't detect hyper-threading?), or even ~20 (?), degrading the performance;

Aiming to parallelise over the tune_grid parts, but realising that data.table::fread I think may be detecting parallel cores also. If the two heavy-lifting tuning parts used all cores that would be the key part to get working.

  • some magical numbers, use properly named variables for that;

You might have to point that out for me - you mean the factor labels using some as numbers?

  • I should've asked you for time estimates before running it tbf; can you add a progress bar or a few to be able to see how much time is left till the end?

Will look into this. There's progress bars automatically added in serialised runs but not sure there's a way to do this with parallel runs (mapping progress across R sessions).

vkhodygo commented 1 year ago

You might have to point that out for me

Well, I meant the number of cores used in particular. However, it's better to avoid anything that might cause misunderstanding. For example, I see x = x * 0.3 as a pretty clear statement, but x = x * 0x3fd3333333333333 doesn't seem to be so. I'd add some context or documentation explaining this.

Other than that I have no questions at the moment. Are you happy with the wall time I provided?

andrewbaxter439 commented 1 year ago

yeah that time is excellent! Tidying up code a little and can see what next stages to iterate over would be. Will follow up. Cheers!

vkhodygo commented 1 year ago

Let me know when you're ready to launch your simulations full-scale.