JGCRI / gcamdata

The GCAM data system
https://jgcri.github.io/gcamdata/
Other
44 stars 26 forks source link

Speed up filters #1211

Closed russellhz closed 2 years ago

russellhz commented 2 years ago

Originally based on #1068

However, for clarity we will generally not be replacing filter(x, z %in% y$z) with semi_join(x, y, by = "z").

This PR includes:

  1. Replacing filter(x, !(z %in% y$z)) with anti_join(x, y, by = "z"),
  2. Significant speed improvement to zchunk_L252.MACC.R
  3. Removing unnecessary filtering in zchunk_L224.heat.R
  4. General clean up, such as removing consecutive filter() calls.
codecov[bot] commented 2 years ago

Codecov Report

Merging #1211 (d0aa3c6) into develop (624061d) will decrease coverage by 0.02%. The diff coverage is 80.00%.

@@             Coverage Diff             @@
##           develop    #1211      +/-   ##
===========================================
- Coverage    97.08%   97.05%   -0.03%     
===========================================
  Files          429      429              
  Lines        69405    69396       -9     
===========================================
- Hits         67380    67352      -28     
- Misses        2025     2044      +19     
Impacted Files Coverage Δ
R/utils.R 85.15% <0.00%> (-6.06%) :arrow_down:
R/zchunk_L102.nonco2_ceds_R_S_Y.R 30.32% <0.00%> (ø)
R/zchunk_LA100.IEA_downscale_ctry.R 17.61% <0.00%> (ø)
R/zchunk_batch_elec_segments_water_USA_xml.R 100.00% <ø> (ø)
R/zchunk_L224.heat.R 89.32% <76.47%> (-0.12%) :arrow_down:
R/driver.R 72.00% <100.00%> (ø)
R/zchunk_L210.resources.R 99.84% <100.00%> (ø)
R/zchunk_L223.electricity.R 94.89% <100.00%> (ø)
R/zchunk_L2233.electricity_water.R 99.88% <100.00%> (ø)
R/zchunk_L2321.cement_USA.R 99.71% <100.00%> (ø)
... and 17 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

russellhz commented 2 years ago

@pralitp I didn't want to forbid it because there are some circumstances where it's not faster, or would be a pain to avoid. For example, this line: filter(iso %in% FAO_PRODSTAT_DOWNSCALED$iso & !iso %in% LDS_ctry_SHARES$iso).

Or this line filter(!(region %in% A_regions$region[A_regions$tradbio_region == 0] & resource == "traditional biomass"))

I have written regex pattern that catches these instances, but it's hard to avoid catching instances like this as well: mutate(Non.CO2 = if_else(Non.CO2 == "SO2" & !region %in% states_subregions$state, SO2_name, Non.CO2)) %>%.

We can check for the word filter in the line, but that would skip any instances where it would be on the second line. Which might be ok because usually in that case, there be multiple conditions to the filter() command, and therefore it's probably faster to not use anti_join()

All of this to say, that I wrote a new screen_undesired() function that we can use to check this without including it in the tests.

Also, generally, the most important slow-downs I've found is unnecessary grouped operations, which I don't think we'll be able to screen for.