Open zschira opened 1 year ago
Git LFS has a 5GB limit which would work for this file but might cause issue down the road if we ever have larger assets to store.
This definitely sounds like the kind of thing mlflow
is designed for. How would it be deployed? How long would it take to set up? If it's relatively simple, I think it's worth setting up so we're using the standard tool for larger models and updates.
This seems like an entirely new variety of dependency that we haven't had to work with before -- where the current ETL run depends on the outputs of previous ETL runs.
That feels similar to the regression testing / output differ that @rousik has been working on. How does Dagster think about storing ML models as assets for later re-use (also here and an MLFlow integration)? Is it just pickling dataframes or sklearn classes? Would it make sense to save models alongside our other durable outputs which get written into cloud storage? They would be available in object storage then.
I think we should ensure that the ETL falls back gracefully to figuring things out on its own if there's no cached version of these outputs available, which would mean we need to test the case where they aren't available. Otherwise we may go back to update the model once a quarter or year or whatever and discover that... it doesn't work.
Does this thing really need to be a GB? That's surprising to me, but maybe I don't understand how things work. Why are there so many weights? How much faster is "much faster"? Are we talking about saving minutes or hours?
I don't think using the Datastore without anything backing it on Zenodo sounds like a great option. That's kind of contrary to the design/use case of the Datastore right now.
Given that all of our stuff is open we could look at the 🤗 model hub for storage. At least as an example of how folks store models for later use. It looks like they're using GitLFS. It looks like they have integrations with a bunch of different ML libraries, including sklearn.
Having fairly static model that we feed into ETL for speed-up sounds like a reasonable approach to me. This is indeed a new paradigm for pudl, so we should consider it a bit carefully.
My suggestions would be:
Is there a reason why we couldn't publish the model weights on zenodo similarly to how we publish other datasets?
Does this thing really need to be a GB? That's surprising to me, but maybe I don't understand how things work. Why are there so many weights? How much faster is "much faster"? Are we talking about saving minutes or hours?
It does seem quite large, but the matrix that ultimately gets fed into the PCA after all the embedding steps has something like 50,000 columns. It looks sklearn is just pickling the pipeline class that we create plus some other glue/metadata files on top, but I'm not 100% sure what the output would look like when using different tools and integrations. As for the time, it's taking over 20 mins to train, so not absurd, but also we probably don't want to be running this every ETL run.
The hugging face model hub looks pretty cool, and it seems like it would be completely free implement since the storage backend would just be github. I also think if we used one repo per model, we probably wouldn't run into size limits for git lfs.
@bendnorman the deployment of mlflow
would require a cloud run instance to host the "tracking server" that you interact with directly via a client library, you need the storage of course, and a cloud sql instance if we want to use mlflow for experiment tracking as well. It would definitely require a bit of work to set up, and is more infrastructure to maintain, but I think the advantages would be that, like you said, it's a pretty standard tool, it's very feature-full, and would be agnostic to whatever cloud backend we use.
One other option we could use is GCP's builtin vertex ai, which is free to use for just the model registry side (excluding storage/egress costs). Biggest downside to this option I see is that we'd really be tying ourselves to the google ecosystem.
A question that comes up from this discussion is how important is it that the trained models are openly accessible to everyone? Most of these options would be using google cloud storage which will limit access for people outside catalyst. Maybe this isn't that big of a deal since we're primarily distributing data now, and if you want to run the software it could still just run the training locally, but still I could see this is at least a slight barrier for external contributors. This might not be too big of a problem now, but could become larger if we start using more complex models that take longer to train, and/or have a non-deterministic training process.
There's no apriori reason we couldn't put the models on Zenodo. It's just a file. But it's not particularly general purpose data of public interest, and it's very different from the raw data inputs or generally useful long-term public outputs that we've published there historically, and that Zenodo is trying to facilitate the storage and dissemination of.
I'm not sure all this new infrastructure is really warranted to save 20 minutes in an ETL that takes hours every night, especially if its one CPU for 20 minutes -- there's plenty of CPU slack in the full ETL already that this could soak up (unless the new model training is saturating all CPUs).
Every linkages to an outside system will be subject to failure, and it'll be an additional maintenance burden, and as evidenced by our recent series of maintenance blowups (e.g. Zenodo API changes, python dependency inconsistencies). I don't know how much more maintenance we can reasonably take on.
The benefit to having the model be versioned and publicly available seems pretty marginal to me (though, no harm in having it out there!). It's just such a niche thing that the model is doing. Nobody else is ever going to use it, and it can be retrained in a few minutes in the pipeline, which seems like a simpler means of reproduction to me.
@zaneselvans that's definitely a very important point, maintenance has become a pretty huge time sink. I do think we still need to consider the time we lose waiting on tests/local ETL runs to finish, which also eats up a significant and hard to quantify amount of our time. Right now the training time for the fast ETL subset is not huge on the scale of a CI run, but also not completely negligible (~5 mins on my computer), and while there might be CPU to spare, it is quite memory intensive, so there still could be a bottleneck. I can try to do some test runs and see how much the total CI time varies between a run with pre-trained weights and one that does on-the-fly training.
One final factor we might want to consider is how much ML stuff we foresee doing in the future? If we end up using models that require dedicated GPU's for training, or take hours to train, then we're certainly going to need this type of infrastructure, so we could decide to use the CCAI resources we have right now to start preparing our system to accommodate that, or we could just cross that bridge when we get there.
I can imagine a future in which we're producing pre-trained models and re-using them, but I don't think we're there right now and it's not clear that we're going to get there in the near future. I think the idea also kind of makes me nervous since we won't necessarily be retraining the model when the underlying data changed. That change could either be due to new data being released, in which case hopefully we would remember to retrain the model, or it could be due to changes in the code, which might be easy to forget if we don't realize we've changed the data that's being fed in and would have been used for training.
I feel like we're in kind of a desperate sprint to get usable CCAI outputs integrated into PUDL at this point and it doesn't feel to me like we have much in the way of resources (either time or $$$) to spend on this infrastructure setup. I could imagine it falling under the umbrella of the data validation / CI / contributor ergonomics stuff in the NSF grant if we get it, but right now I think something that is actually integrated, definitely reproducible, and represents and improvement over the past system even if the performance isn't ideal is the priority.
When you say that the model "takes 20 minutes to train" do you mean:
Is there a side by side comparison of the old pre-CCAI clustering/classification and the new version (and maybe the new version with a pre-trained model) somewhere that shows for each of them:
If training the new model takes 20 minutes but it's mixed in with all the other assets and doesn't end up being the thing that controls the length of the run, then it doesn't seem too concerning to me. The current method on dev
already takes 20 minutes. Of course it would be really nice to cut it down! And reduce memory usage! And get better results! But it sounds like the new system is not, as hoped, actually intrinsically more resource efficient, so our best hope is that it produces significantly better results without using a bunch more resources. And in that circumstance I think not introducing additional infrastructural complexity and dependencies right now is probably the right thing to do.
I've just been materializing the steam plants asset on it's own and the total runtime has been on the order of 20 minutes. I haven't ran the full ETL yet, so not 100% sure what the impact will be. I'll work on pulling together a comparison today. The metrics we have so far for the new model are encouraging that it's performing better than the old one, although it will always be difficult to say without ground truth.
I definitely agree that top priority needs to be just getting something into production that is an improvement over the old implementation. With pre-trained weights, the model runs in just seconds, so it will be disappointing to leave that optimization on the table right now, but I think you've made a pretty strong case that we should hold off for now, and we always come back in the future when we have more time and resources to invest in new infrastructure.
I just think it's too late in the project to open up this new infrastructural question. We should look at it in the context of the NSF CI / contributor ergonomics work if we get the grant. I'm also pretty paranoid about caching at this point. It's so easy to end up with something that's out-of-sync causing problems and are difficult to debug. I look forward to seeing the comparison though!
Background
The updated FERC-FERC inter-year plant matching model in #3007 uses PCA, which is much faster if we pre-fit the model and save the weights somewhere. However, when we cache the model using sklearn's built in tooling for this, it contains many files and occupies close to 1GB of disk space, so we probably shouldn't be committing this directly to PUDL.
Some possible approaches I see for dealing with this:
Use git lfs
I don't have much experience with git lfs, so I don't have a great sense for the tradeoffs involved, but seems very possible.
Use GCS and the
Datastore
We could upload the weights to a cloud bucket and potentially use the
Datastore
for access. We need to be able to upload weights, and also wouldn't be using zenodo as the backend, so this might need to be reworked to use for this purpose. The model doesn't really need to be updated frequently, so maybe we could probably make the pre-fitting/uploading a manual process, but that doesn't feel ideal.Use GCS with
mlflow
mlflow
has tooling for storing models and associating models with different performance metrics. It has built-in integration with sklearn and other ml frameworks, and can use GCS as a storage backend. This tooling is nice, but also might be overkill to just store weights for a pretty simple model that doesn't need to change frequently. However, if we plan to tackle more of these record linkage problems, and potentially integrate more complex models into PUDL, then maybe it would be smart to start moving in this direction.