Closed stephen-frank closed 10 months ago
Test command used on SkySpark development server (3lv14skyspark01.nrel.gov):
read(wattileTask)
.taskSend({
action: "predict",
model: read(wattileModel and dis=="FTLB_FTLBCHWMeterCHWEnergyRate_r1"),
span: 2023-01-01
})
.futureGet
Correction, this was from a Wattile image built using the nrelWattileExt Dockerfile and built from Wattile main
about 7 weeks ago. (I copied to to SkySpark 11/27 but the actual build was from 7 weeks ago. I don't think anything has significantly changed since then.) I can provide the image if needed for troubleshooting.
I noticed the configs don't specify predictors, which means it'll use all the predictors. Looking at predictors_target_config.json
, looks like 13 predictors were used, was that how many where passed?
Yeah, SkySpark is set up to use the set of predictors predictors_target_config.json
. (Or, more specifically, when it first imports the model it maps those predictors into a SkySpark record and stores them for later use.) I confirmed that SkySpark is set up to pass 13 predictor columns for this model. Which makes sense, as I haven't changed the predictors since I started testing these models. What has changed is the Wattile docker image.
hm, after some testing on my and @smithcommajoseph part, we were able to reproduce this error only via column mismatches... are you sure they are all named the same? @JanghyunJK, any ideas?
Interesting. Let me double-check the column names again in SkySpark. They should not have changed but it is possible I missed something.
definitely first time seeing that type of error. to me, (1) what those numbers 237/198 are representing and (2) checking which one is right based on configs.json
would be something I'd try.
some quick math I tried. maybe this is showing what 237 and 198 are:
data type | count of features | count of features |
---|---|---|
number of predictors raw | 13 | 0 |
number of predictors + time-based features | 79 | 66 |
number of predictors + time-based features + stat-based features | 237 | 198 |
number of predictors + time-based features + stat-based features + timelags | 5925 | 4950 |
assuming "current model" means trained model spec, maybe during training
Hmm. I would think copying a param with shape torch.Size([100, 237]) from checkpoint
means what it is loading from the disk, meaning the trained model. I am going to try to export from Python the exact data frame that SkySpark is passing, which should be helpful for troubleshooting. But at the moment I am again running into a permissions error, when I shouldn't be. >.>
I got the permissions working again. I had SkySpark export the exact data frame that it is passing into the model. I sent this to you three via email.
@haneslinger @smithcommajoseph would it be possible for either of you to pick this up for testing Steve's data on Wattile workflow? I'm really struggling finding time for this.
I have time to poke about and explore this issue and will reach out to folks with questions/data as I encounter them.
Okay, here's the skinny:
prep_for_rnn
spits out a dataframe shaped (samples, lag*input_dim). input_dim
is the number of columns post time processing and rolling, but pre lags. prep_for_rnn
also sets input_dim
in the config. prep_for_rnn
a dataframe with 13 predictors, it returns a dataframe with 237 columns.prep_for_rnn
a dataframe with 0 predictors, it returns a dataframe with 198 columns. It's just the time columns.input_dim
is later used when initing the model. First input_dim
is use to build the model structure, then it's populated by the weights of the trained model. If the model structure and the weights are different shapes, you get the error you're getting.So I think you aren't passing in any predictors.
questions and lessons:
predictors_target_config.json
to help catch some of those errors.prep_for_rnn
should not be modifying the config at all. This has been on my fix list for a while and this is why.
- 198 feels like SO many time based columns. @JanghyunJK @smithcommajoseph, thoughts?
that confirms my math above, thanks! Steve sent out an email including a sample data so it'd be nice if we can reproduce the 198 from the wattile.
if you give
prep_for_rnn
a dataframe with 0 predictors, it returns a dataframe with 198 columns. It's just the time columns... So I think you aren't passing in any predictors.
This is helpful. Except that SkySpark is passing in 13 predictor columns. So the current puzzle is why Wattile does not like the predictor columns I am passing in from SkySpark. Some kind of name mismatch? I hope the sample data can illuminate that?
This does also get back at something I think we've talked about but maybe never created an issue for: when a model is first initialized prior to training, it is valid to have no predictors in configs.json
; then model will train on all available predictors. However, any time after that first initialization, the model should always use the same predictor columns as it was first initialized with. That is, once predictors_target_config.json
gets created, it should be used thereafter to lock in the model's exact set of predictors and target by column name.
I tested it with the pickle you emailed and it worked.
I tested it with the pickle you emailed and it worked.
Hmm! That is very strange.
Similarly, I was able to test locally against the emailed CSV and was able to successfully obtain predictions from the model.
@stephen-frank Re:_'That is, once predictors_targetconfig.json gets created, it should be used thereafter to lock in the model's exact set of predictors and target by column name.' @haneslinger and I discussed this as well. sounds like we're all in agreement around this and unless there was a technical challenge that I missed, this looked pretty easy to implement. Happy to make this happen.
@haneslinger happy to work on (and/or pair through) the code-related tasks identified in this thread. I'll reach out separately to discuss.
Ok, I'm rather at a loss.
prep_for_rnn
with the data I provided, you seem to get a val_df
with 238 columns, of which the first few columns are the features generated from the 13 predictor columns. (I noticed this is one greater than the needed 237; maybe it is the ts
column that is the "extra"?)val_df
from within SkySpark, though, I get dimensions of 95 x 4951, which is not the same at all. I've attached the val_df
in CSV format here. It seems to have all the time lags added, which the test case you have in predict.ipynb doesn't.So not only is the val_df
that got created inside SkySpark's Docker image very different from your test version, but also it has dimensions that don't match either 198 (199?) or 237 (238?).
Here's my updated test code:
// TROUBLESHOOTING: Export predictor_data_frame
session
.pyExec("predictor_data_frame.to_csv('/io/wattile/test_predictor_data.csv')")
.pyExec("predictor_data_frame.to_pickle('/io/wattile/test_predictor_data.pickle')")
// Prep data and run prediction
session
.pyExec("_, val_df = prep_for_rnn(configs, predictor_data_frame)")
//.pyExec("results = model.predict(val_df)") // TROUBLESHOOTING: Disable
// TROUBLESHOOTING: Export val_df and return its size
return session
.pyExec("val_df.to_csv('/io/wattile/test_val_df.csv')")
.pyExec("val_df.to_pickle('/io/wattile/test_val_df.pickle')")
.pyEval("val_df.shape")
In addition, in SkySpark if I run pyEval("configs['input_dim']")
I get 198, so that doesn't match either. 🤔
SkySpark does update configs.json
to prepare the model for prediction, but otherwise doesn't mess with it. I also have it modify predictors_target_config.json
to change the id's of the SkySpark points, because for testing I need them to be different than the original points. (The column names, on the other hand, are not modified.)
Here are the modified versions of these files that SkySpark is using, in case it makes any difference.
I checked configs.json
against the original and confirmed only the "use_case"
and "exp_dir"
keys differ... both of which I believe I have to modify to get the model to work inside SkySpark Dockerland.
EDIT: Of note, the "input_dim"
in configs.json
is still 237... so where and how does it change to 198 when I'm trying to call predict?
When I export val_df from within SkySpark, though, I get dimensions of 95 x 4951, which is not the same at all.
prep_for_rnn spits out a dataframe shaped (samples, lag*input_dim). input_dim is the number of columns post time processing and rolling, but pre lags. prep_for_rnn also sets input_dim in the config.
4951 minus the target column is 4950.
4850 divided by 198 "input_dim" is 25, the number of lags.
When you ran prep_for_rnn with the data I provided, you seem to get a val_df with 238 columns, of which the first few columns are the features generated from the 13 predictor columns. (I noticed this is one greater than the needed 237; maybe it is the ts column that is the "extra"?)
this is the correct number of columns. The one extra column is the dumby target column.
4951 minus the target column is 4950.
4850 divided by 198 "input_dim" is 25, the number of lags.
Ok, I'm following, but your example val_df
(that worked with predict!) did not have 4951 columns. As far as I can tell all the steps in predict.ipynb are exactly the same as what SkySpark is doing, but the results are not the same.
I have also determined that when I run prep_for_rnn
, configs
"input_dim"
is updated from 237 to 198! This also differs from your test. Test code:
// TROUBLESHOOTING: What is input_dim here?
// With FTLB_FTLBCHWMeterCHWEnergyRate_r1: 237
//return session.pyEval("configs['input_dim']")
// Prep data and run prediction
session
.pyExec("_, val_df = prep_for_rnn(configs, predictor_data_frame)")
//.pyExec("results = model.predict(val_df)") // TROUBLESHOOTING: Disable
// TROUBLESHOOTING: What is input_dim here?
// With FTLB_FTLBCHWMeterCHWEnergyRate_r1: 198 (!)
// Conclusion: prep_for_rnn modifies input_dim?
return session.pyEval("configs['input_dim']")
I have also determined that when I run prep_for_rnn, configs "input_dim" is updated from 237 to 198! This also differs from your test. Test code:
What do you mean if differs from my test? I feel like this is what I said was happening:
- prep_for_rnn spits out a dataframe shaped (samples, lag*input_dim). input_dim is the number of columns post time processing and rolling, but pre lags. prep_for_rnn also sets input_dim in the config.
- if you give prep_for_rnn a dataframe with 13 predictors, it returns a dataframe with 237 columns.
- if you give prep_for_rnn a dataframe with 0 predictors, it returns a dataframe with 198 columns. It's just the time columns.
What I mean is I'm running the same command on the same data with the same configs.json
, and getting different results:
val_df
returned for the predictor data frame has 238 columns. In SkySpark, the val_df
data frame returned has 4951 columns.prep_for_rnn
the reported "input_dim"
is still 237, as shown below. In SkySpark, after running prep_for_rnn
the reported "input_dim"
is 198.This is why I'm at a loss. Your math certainly adds up, but I'm not giving it a data frame with 0 predictors. I'm giving it the same data as used in your test and getting a different result.
I may just need to do a screen share of this next Monday, or sooner if you and Jos want to jump on a call this week.
Ok, I'm following, but your example val_df (that worked with predict!) did not have 4951 columns. As far as I can tell all the steps in predict.ipynb are exactly the same as what SkySpark is doing, but the results are not the same.
That's a good point. Revisiting that notebook, the results are all nan, so maybe it didn't work? Let me investigate
That's a good point. Revisiting that notebook, the results are all nan, so maybe it didn't work? Let me investigate
You may need to adjust configs["data_input"]["start_time"]
and configs["data_input"]["end_time"]
.
do you know what to?
This is why I'm at a loss. Your math certainly adds up, but I'm not giving it a data frame with 0 predictors. I'm giving it the same data as used in your test and getting a different result.
I believe you, I'm just running out of ideas.
I may just need to do a screen share of this next Monday, or sooner if you and Jos want to jump on a call this week.
I think that would be smart
I think it'll have to be Friday, if you two are available. Otherwise we'll just look at it on Monday.
Friday works
test_val_df.csv does not contain non-timebased predictors, in contrast to the val_df built from running prep_for_rnn
on test_predictor_data.pickle.
Update from today's troubleshooting session: Turns out I was running an old Wattile version due to a bug with the docker build. This is addressed with https://github.com/NREL/nrelWattileExt/pull/46. I will test whether the problem goes away when I have the right Wattile version, which I expect.
Success! Now that I'm on the correct version of Wattile, this problem is magically gone, and probably reflected some bug we fixed a year ago. Closing this issue.
[like] Kim, Janghyun reacted to your message:
From: Stephen Frank @.> Sent: Saturday, December 9, 2023 2:10:17 AM To: NREL/Wattile @.> Cc: Kim, Janghyun @.>; Mention @.> Subject: Re: [NREL/Wattile] Janghyun's trained models fail to run with prediction on SkySpark: torch size mismatch (Issue #292)
CAUTION: This email originated from outside of NREL. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Success! Now that I'm on the correct version of Wattile, this problem is magically gone, and probably reflected some bug we fixed a year ago. Closing this issue.
image.png (view on web)https://github.com/NREL/Wattile/assets/1534082/6eb8bc73-c26e-4ef7-84c5-a483137e610c
— Reply to this email directly, view it on GitHubhttps://github.com/NREL/Wattile/issues/292#issuecomment-1848090269, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGMJB3O2IU364UPMDD6KWSTYIPCATAVCNFSM6AAAAAA76EP3C6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBYGA4TAMRWHE. You are receiving this because you were mentioned.Message ID: @.***>
With the latest build of Wattile (from
main
, built 11/27/2023), in SkySpark, I get the following error when attempting to run prediction from the models that Janghyun trained back in May:I was testing with a freshly loaded
FTLB_FTLBCHWMeterCHWEnergyRate_r1
model, obtained from thev10_small
directory shared from JangHyun via OneDrive. I did not have this issue with previous builds of Wattile.Is this reproducible outside of SkySpark? I can share both the Docker image I am using and the model if needed.