Janghyun's trained models fail to run with prediction on SkySpark: torch size mismatch

stephen-frank commented 11 months ago

With the latest build of Wattile (from main, built 11/27/2023), in SkySpark, I get the following error when attempting to run prediction from the models that Janghyun trained back in May:

axon::EvalErr: Func failed: pyEval(PySession py,Str stmt); args: (PyMgrSession,Str)
  sys::IOErr: Python failed: Error(s) in loading state_dict for LSTM_Model:
    size mismatch for lstm.weight_ih_l0: copying a param with shape torch.Size([100, 237]) from checkpoint, the shape in current model is torch.Size([100, 198]).
Traceback (most recent call last):
  File "/usr/src/app/hxpy/hxpy.py", line 67, in run
    self._exec(instr, local_vars)
  File "/usr/src/app/hxpy/hxpy.py", line 95, in _exec
    return exec(code, local_vars, local_vars)
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.9/site-packages/wattile/models/AlgoMainRNNBase.py", line 191, in predict
    return self.run_prediction(val_loader, val_df)
  File "/usr/local/lib/python3.9/site-packages/wattile/models/alfa_model.py", line 801, in run_prediction
    model, _, _ = load_model(self.configs)
  File "/usr/local/lib/python3.9/site-packages/wattile/models/utils.py", line 68, in load_model
    model.load_state_dict(checkpoint["model_state_dict"])
  File "/usr/local/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1671, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LSTM_Model:
    size mismatch for lstm.weight_ih_l0: copying a param with shape torch.Size([100, 237]) from checkpoint, the shape in current model is torch.Size([100, 198]).
 [nrelWattile::wattilePythonModelPredict:146]
=== Axon Trace ===
  wattilePythonModelPredict (nrelWattile::wattilePythonModelPredict:146)
  wattilePythonTask (nrelWattile::wattilePythonTask:29)

I was testing with a freshly loaded FTLB_FTLBCHWMeterCHWEnergyRate_r1 model, obtained from the v10_small directory shared from JangHyun via OneDrive. I did not have this issue with previous builds of Wattile.

Is this reproducible outside of SkySpark? I can share both the Docker image I am using and the model if needed.

stephen-frank commented 11 months ago

Test command used on SkySpark development server (3lv14skyspark01.nrel.gov):

read(wattileTask)
  .taskSend({
    action: "predict",
    model: read(wattileModel and dis=="FTLB_FTLBCHWMeterCHWEnergyRate_r1"),
    span: 2023-01-01
  })
  .futureGet

stephen-frank commented 11 months ago

Correction, this was from a Wattile image built using the nrelWattileExt Dockerfile and built from Wattile main about 7 weeks ago. (I copied to to SkySpark 11/27 but the actual build was from 7 weeks ago. I don't think anything has significantly changed since then.) I can provide the image if needed for troubleshooting.

haneslinger commented 11 months ago

I noticed the configs don't specify predictors, which means it'll use all the predictors. Looking at predictors_target_config.json, looks like 13 predictors were used, was that how many where passed?

stephen-frank commented 11 months ago

Yeah, SkySpark is set up to use the set of predictors predictors_target_config.json. (Or, more specifically, when it first imports the model it maps those predictors into a SkySpark record and stores them for later use.) I confirmed that SkySpark is set up to pass 13 predictor columns for this model. Which makes sense, as I haven't changed the predictors since I started testing these models. What has changed is the Wattile docker image.

haneslinger commented 11 months ago

hm, after some testing on my and @smithcommajoseph part, we were able to reproduce this error only via column mismatches... are you sure they are all named the same? @JanghyunJK, any ideas?

stephen-frank commented 11 months ago

Interesting. Let me double-check the column names again in SkySpark. They should not have changed but it is possible I missed something.

JanghyunJK commented 11 months ago

definitely first time seeing that type of error. to me, (1) what those numbers 237/198 are representing and (2) checking which one is right based on configs.json would be something I'd try.

JanghyunJK commented 11 months ago

some quick math I tried. maybe this is showing what 237 and 198 are:

data type	count of features	count of features
number of predictors raw	13	0
number of predictors + time-based features	79	66
number of predictors + time-based features + stat-based features	237	198
number of predictors + time-based features + stat-based features + timelags	5925	4950

JanghyunJK commented 11 months ago

assuming "current model" means trained model spec, maybe during training

weather data was not read
but it went through training process anyways
just including time-based features as starter
so ended up getting 198 features as model spec

stephen-frank commented 11 months ago

Hmm. I would think copying a param with shape torch.Size([100, 237]) from checkpoint means what it is loading from the disk, meaning the trained model. I am going to try to export from Python the exact data frame that SkySpark is passing, which should be helpful for troubleshooting. But at the moment I am again running into a permissions error, when I shouldn't be. >.>

stephen-frank commented 11 months ago

I got the permissions working again. I had SkySpark export the exact data frame that it is passing into the model. I sent this to you three via email.

JanghyunJK commented 11 months ago

@haneslinger @smithcommajoseph would it be possible for either of you to pick this up for testing Steve's data on Wattile workflow? I'm really struggling finding time for this.

smithcommajoseph commented 11 months ago

I have time to poke about and explore this issue and will reach out to folks with questions/data as I encounter them.

haneslinger commented 11 months ago

Okay, here's the skinny:

because neither of the configs contain the predictor columns, wattile will not correct you if you put in the wrong columns.
prep_for_rnn spits out a dataframe shaped (samples, lag*input_dim). input_dim is the number of columns post time processing and rolling, but pre lags. prep_for_rnn also sets input_dim in the config.
if you give prep_for_rnn a dataframe with 13 predictors, it returns a dataframe with 237 columns.
if you give prep_for_rnn a dataframe with 0 predictors, it returns a dataframe with 198 columns. It's just the time columns.
input_dim is later used when initing the model. First input_dim is use to build the model structure, then it's populated by the weights of the trained model. If the model structure and the weights are different shapes, you get the error you're getting.

So I think you aren't passing in any predictors.

questions and lessons:

configs with no columns are prone to errors. That said, we can add a check in predictors_target_config.json to help catch some of those errors.
prep_for_rnn should not be modifying the config at all. This has been on my fix list for a while and this is why.
198 feels like SO many time based columns. @JanghyunJK @smithcommajoseph, thoughts?

JanghyunJK commented 11 months ago

198 feels like SO many time based columns. @JanghyunJK @smithcommajoseph, thoughts?

that confirms my math above, thanks! Steve sent out an email including a sample data so it'd be nice if we can reproduce the 198 from the wattile.

stephen-frank commented 11 months ago

if you give prep_for_rnn a dataframe with 0 predictors, it returns a dataframe with 198 columns. It's just the time columns... So I think you aren't passing in any predictors.

This is helpful. Except that SkySpark is passing in 13 predictor columns. So the current puzzle is why Wattile does not like the predictor columns I am passing in from SkySpark. Some kind of name mismatch? I hope the sample data can illuminate that?

This does also get back at something I think we've talked about but maybe never created an issue for: when a model is first initialized prior to training, it is valid to have no predictors in configs.json; then model will train on all available predictors. However, any time after that first initialization, the model should always use the same predictor columns as it was first initialized with. That is, once predictors_target_config.json gets created, it should be used thereafter to lock in the model's exact set of predictors and target by column name.

haneslinger commented 11 months ago

I tested it with the pickle you emailed and it worked.

https://github.com/NREL/Wattile/compare/jos-ftlb-tests

stephen-frank commented 11 months ago

I tested it with the pickle you emailed and it worked.

Hmm! That is very strange.

smithcommajoseph commented 11 months ago

Similarly, I was able to test locally against the emailed CSV and was able to successfully obtain predictions from the model.

@stephen-frank Re:_'That is, once predictors_targetconfig.json gets created, it should be used thereafter to lock in the model's exact set of predictors and target by column name.' @haneslinger and I discussed this as well. sounds like we're all in agreement around this and unless there was a technical challenge that I missed, this looked pretty easy to implement. Happy to make this happen.

@haneslinger happy to work on (and/or pair through) the code-related tasks identified in this thread. I'll reach out separately to discuss.

stephen-frank commented 11 months ago

Ok, I'm rather at a loss.

When you ran prep_for_rnn with the data I provided, you seem to get a val_df with 238 columns, of which the first few columns are the features generated from the 13 predictor columns. (I noticed this is one greater than the needed 237; maybe it is the ts column that is the "extra"?)
When I export val_df from within SkySpark, though, I get dimensions of 95 x 4951, which is not the same at all. I've attached the val_df in CSV format here. It seems to have all the time lags added, which the test case you have in predict.ipynb doesn't.

So not only is the val_df that got created inside SkySpark's Docker image very different from your test version, but also it has dimensions that don't match either 198 (199?) or 237 (238?).

Here's my updated test code:

// TROUBLESHOOTING: Export predictor_data_frame
session
  .pyExec("predictor_data_frame.to_csv('/io/wattile/test_predictor_data.csv')")
  .pyExec("predictor_data_frame.to_pickle('/io/wattile/test_predictor_data.pickle')")

  // Prep data and run prediction
  session
    .pyExec("_, val_df = prep_for_rnn(configs, predictor_data_frame)")
    //.pyExec("results = model.predict(val_df)") // TROUBLESHOOTING: Disable

// TROUBLESHOOTING: Export val_df and return its size
return session
  .pyExec("val_df.to_csv('/io/wattile/test_val_df.csv')")
  .pyExec("val_df.to_pickle('/io/wattile/test_val_df.pickle')")
  .pyEval("val_df.shape")

test_val_df.csv

In addition, in SkySpark if I run pyEval("configs['input_dim']") I get 198, so that doesn't match either. 🤔

stephen-frank commented 11 months ago

SkySpark does update configs.json to prepare the model for prediction, but otherwise doesn't mess with it. I also have it modify predictors_target_config.json to change the id's of the SkySpark points, because for testing I need them to be different than the original points. (The column names, on the other hand, are not modified.)

Here are the modified versions of these files that SkySpark is using, in case it makes any difference.

I checked configs.json against the original and confirmed only the "use_case" and "exp_dir" keys differ... both of which I believe I have to modify to get the model to work inside SkySpark Dockerland.

EDIT: Of note, the "input_dim" in configs.json is still 237... so where and how does it change to 198 when I'm trying to call predict?

haneslinger commented 11 months ago

When I export val_df from within SkySpark, though, I get dimensions of 95 x 4951, which is not the same at all.

prep_for_rnn spits out a dataframe shaped (samples, lag*input_dim). input_dim is the number of columns post time processing and rolling, but pre lags. prep_for_rnn also sets input_dim in the config.

4951 minus the target column is 4950.

4850 divided by 198 "input_dim" is 25, the number of lags.

haneslinger commented 11 months ago

When you ran prep_for_rnn with the data I provided, you seem to get a val_df with 238 columns, of which the first few columns are the features generated from the 13 predictor columns. (I noticed this is one greater than the needed 237; maybe it is the ts column that is the "extra"?)

this is the correct number of columns. The one extra column is the dumby target column.

stephen-frank commented 11 months ago

4951 minus the target column is 4950.

4850 divided by 198 "input_dim" is 25, the number of lags.

Ok, I'm following, but your example val_df (that worked with predict!) did not have 4951 columns. As far as I can tell all the steps in predict.ipynb are exactly the same as what SkySpark is doing, but the results are not the same.

stephen-frank commented 11 months ago

I have also determined that when I run prep_for_rnn, configs "input_dim" is updated from 237 to 198! This also differs from your test. Test code:

// TROUBLESHOOTING: What is input_dim here?
// With FTLB_FTLBCHWMeterCHWEnergyRate_r1: 237
//return session.pyEval("configs['input_dim']")

  // Prep data and run prediction
  session
    .pyExec("_, val_df = prep_for_rnn(configs, predictor_data_frame)")
    //.pyExec("results = model.predict(val_df)") // TROUBLESHOOTING: Disable

// TROUBLESHOOTING: What is input_dim here?
// With FTLB_FTLBCHWMeterCHWEnergyRate_r1: 198 (!)
// Conclusion: prep_for_rnn modifies input_dim?
return session.pyEval("configs['input_dim']")

haneslinger commented 11 months ago

I have also determined that when I run prep_for_rnn, configs "input_dim" is updated from 237 to 198! This also differs from your test. Test code:

What do you mean if differs from my test? I feel like this is what I said was happening:

- prep_for_rnn spits out a dataframe shaped (samples, lag*input_dim). input_dim is the number of columns post time processing and rolling, but pre lags. prep_for_rnn also sets input_dim in the config.
- if you give prep_for_rnn a dataframe with 13 predictors, it returns a dataframe with 237 columns.
- if you give prep_for_rnn a dataframe with 0 predictors, it returns a dataframe with 198 columns. It's just the time columns.

stephen-frank commented 11 months ago

What I mean is I'm running the same command on the same data with the same configs.json, and getting different results:

In predict.ipynb, the val_df returned for the predictor data frame has 238 columns. In SkySpark, the val_df data frame returned has 4951 columns.
In predict.ipynb, after running prep_for_rnn the reported "input_dim" is still 237, as shown below. In SkySpark, after running prep_for_rnn the reported "input_dim" is 198.

This is why I'm at a loss. Your math certainly adds up, but I'm not giving it a data frame with 0 predictors. I'm giving it the same data as used in your test and getting a different result.

I may just need to do a screen share of this next Monday, or sooner if you and Jos want to jump on a call this week.

haneslinger commented 11 months ago

Ok, I'm following, but your example val_df (that worked with predict!) did not have 4951 columns. As far as I can tell all the steps in predict.ipynb are exactly the same as what SkySpark is doing, but the results are not the same.

That's a good point. Revisiting that notebook, the results are all nan, so maybe it didn't work? Let me investigate

stephen-frank commented 11 months ago

That's a good point. Revisiting that notebook, the results are all nan, so maybe it didn't work? Let me investigate

You may need to adjust configs["data_input"]["start_time"] and configs["data_input"]["end_time"].

haneslinger commented 11 months ago

do you know what to?

haneslinger commented 11 months ago

This is why I'm at a loss. Your math certainly adds up, but I'm not giving it a data frame with 0 predictors. I'm giving it the same data as used in your test and getting a different result.

I believe you, I'm just running out of ideas.

I may just need to do a screen share of this next Monday, or sooner if you and Jos want to jump on a call this week.

I think that would be smart

stephen-frank commented 11 months ago

I think it'll have to be Friday, if you two are available. Otherwise we'll just look at it on Monday.

haneslinger commented 11 months ago

Friday works

haneslinger commented 11 months ago

test_val_df.csv does not contain non-timebased predictors, in contrast to the val_df built from running prep_for_rnn on test_predictor_data.pickle.

stephen-frank commented 10 months ago

Update from today's troubleshooting session: Turns out I was running an old Wattile version due to a bug with the docker build. This is addressed with https://github.com/NREL/nrelWattileExt/pull/46. I will test whether the problem goes away when I have the right Wattile version, which I expect.

stephen-frank commented 10 months ago

Success! Now that I'm on the correct version of Wattile, this problem is magically gone, and probably reflected some bug we fixed a year ago. Closing this issue.

JanghyunJK commented 10 months ago

[like] Kim, Janghyun reacted to your message:

From: Stephen Frank @.> Sent: Saturday, December 9, 2023 2:10:17 AM To: NREL/Wattile @.> Cc: Kim, Janghyun @.>; Mention @.> Subject: Re: [NREL/Wattile] Janghyun's trained models fail to run with prediction on SkySpark: torch size mismatch (Issue #292)

CAUTION: This email originated from outside of NREL. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Success! Now that I'm on the correct version of Wattile, this problem is magically gone, and probably reflected some bug we fixed a year ago. Closing this issue.

image.png (view on web)https://github.com/NREL/Wattile/assets/1534082/6eb8bc73-c26e-4ef7-84c5-a483137e610c

— Reply to this email directly, view it on GitHubhttps://github.com/NREL/Wattile/issues/292#issuecomment-1848090269, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGMJB3O2IU364UPMDD6KWSTYIPCATAVCNFSM6AAAAAA76EP3C6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBYGA4TAMRWHE. You are receiving this because you were mentioned.Message ID: @.***>

NREL / Wattile

Janghyun's trained models fail to run with prediction on SkySpark: torch size mismatch #292