faster apply transform tables #2

SamRWest commented 6 months ago

Second attempt at speeding up apply_transform_tables. This is just the minimum from #213 needed to apply the late explode method.

Overall Ireland benchmark speedup (local machine): Explode=False: Ran Ireland in 10.94s. 90.0% (36629 correct, 3415 additional). Explode=True: Ran Ireland in 14.51s. 90.0% (36614 correct, 3415 additional).

SamRWest commented 6 months ago

Heaps faster, but still getting regressions on Ireland :(

Any tips on how to debug regressions like this @olejandro ?

I tried the main-vs-branch diff approach in the readme, but gave up when the log file ended up approaching 1Gb, which didn't seem right..?

olejandro commented 6 months ago

Cool! :-) Let me check...

@siddharth-krishna is the one behind the main-vs-branch diff approach in the readme, so he'd be the right person to comment on it.

I normally just dump the dataframe with the affected data from the transform and try to find out what's wrong.

siddharth-krishna commented 6 months ago

I tried the main-vs-branch diff approach in the readme, but gave up when the log file ended up approaching 1Gb, which didn't seem right..?

Yeah, this approach doesn't scale well because it dumps every table to the output after every transform, which results in a gigantic log file for a real model like Ireland.

I had used a slightly different approach to debug a nondeterminism bug we had on Ireland a while ago. The idea was to save all the intermediate tables to a pickle file (one file after each transform) in the first run, and in the second run (if the pickle files already existed) to check if the intermediate tables from the run were the same as the ones in the file. If there was any difference, it drops the user into an ipydb console where you can debug and figure out exactly what was different. You can probably adapt this approach to debug regressions: create the pickle files in the first run on the main branch, and do the second run that does the comparisons on your PR branch. Here's the code I used (might need a bit of merging to apply onto the latest main):

https://github.com/etsap-TIMES/xl2times/pull/67/commits/a265bde2d0d770c949c17d2076b0a838cbe06495#diff-e894e8354d6ecd5ffe34b10e5e1df17cfeef656e707476eb460d76931c1907e7R173-R180

If this is helpful we can try to clean up the code and add it to the main branch for later use as well. Good luck!

SamRWest commented 6 months ago

I tried the main-vs-branch diff approach in the readme, but gave up when the log file ended up approaching 1Gb, which didn't seem right..?

Yeah, this approach doesn't scale well because it dumps every table to the output after every transform, which results in a gigantic log file for a real model like Ireland.

I had used a slightly different approach to debug a nondeterminism bug we had on Ireland a while ago. The idea was to save all the intermediate tables to a pickle file (one file after each transform) in the first run, and in the second run (if the pickle files already existed) to check if the intermediate tables from the run were the same as the ones in the file. If there was any difference, it drops the user into an ipydb console where you can debug and figure out exactly what was different. You can probably adapt this approach to debug regressions: create the pickle files in the first run on the main branch, and do the second run that does the comparisons on your PR branch. Here's the code I used (might need a bit of merging to apply onto the latest main):

a265bde#diff-e894e8354d6ecd5ffe34b10e5e1df17cfeef656e707476eb460d76931c1907e7R173-R180

If this is helpful we can try to clean up the code and add it to the main branch for later use as well. Good luck!

Thanks, I ended up a similar route - I've added some utils to save the state (model+tables) and then diff the main vs branch state. Pretty easy, and it spotted the symptom pretty quickly. Though @olejandro has already spotted the cause for me :) Utils are here if you're interested.

etsap-TIMES / xl2times

faster apply transform tables #2 #215