I'd like us to show off the use of dask-xgboost to train models on large datasets. Probably we run this at some longer cadence, and host some FastAPI endpoint with the hosted model (we already have the latter components for a model trained on a smaller dataset).
I don't know exactly how best to go about this, but I suspect that it looks something like ...
Find a good problem within our schema to answer. For example looking at the Lineitem table one might ask "What makes an item likely to be returned?" (See the ReturnFlag column).
Select the set of tables we want in order to answer that (maybe lineitem and supplier merged together or something if that's not too big?)
Do whatever ML stuff one does (cross validation, etc..)
Set that up as a flow with a cluster with appropriate hardware
But again, I don't know this space well, so whoever takes this on would have to take on the burden of figuring out exactly what makes sense and would be compelling. Mostly I just want people to see that dask-xgboost exists and works decently well.
I'd like us to show off the use of dask-xgboost to train models on large datasets. Probably we run this at some longer cadence, and host some FastAPI endpoint with the hosted model (we already have the latter components for a model trained on a smaller dataset).
I don't know exactly how best to go about this, but I suspect that it looks something like ...
But again, I don't know this space well, so whoever takes this on would have to take on the burden of figuring out exactly what makes sense and would be compelling. Mostly I just want people to see that dask-xgboost exists and works decently well.