Open Utayaki opened 5 years ago
@Utayaki, thank you very much for your support of HyperparameterHunter and for raising this issue! Unfortunately, for now, HyperparameterHunter doesn’t handle automatically loading big datasets in batches.
I would absolutely love to add support for this, but my todo list is getting quite long, and I’m completely focused on finalizing v3.0.0 at the moment. So I fear it may be a while unless you (or someone following this issue) is willing to work with me on a solution or submit a PR.
If your dataset isn’t publicly available, can you refer me to a large dataset I can use for testing?
Forgive me if this is a silly question, as I’m quite unfamiliar with processing big data, but is it possible for you to use lambda_callback
to build a custom callback that handles loading and unloading your dataset batches? I really have no idea if this will work, but I do know that lambda_callback
is deceptively powerful and very useful, since it gets injected directly into the parent classes of CVExperiment
. If you’re interested in checking out lambda_callback
, let me know and I’d be happy to go into more detail on the other callback/data features in HyperparameterHunter that might help solve this problem.
Thanks again for raising this issue. Until we find a solution, I'd like to keep it open to see how many others have the same problem and want to help fix it.
@HunterMcGushion Thank you for answering, and sorry for not writing to you for so long!
I will be glad to work with your library on that issue to make it better, that will be a very interesting experience for me. Talking about lambda_callback, I haven't even known about that feature! If that command is really that powerful, I will definitely use it! However, I can't come up with an idea of proper using of lambda_callback, could you explain this to me?
Talking about a big data, how would you like me to give it for testing?
Have a nice day and again, sorry for keeping you waiting for so long!
No worries! Aside from the documentation, which you've already seen, I'd recommend a few other resources:
lambda_callback
to Environment
, so it can be used by CVExperiment
and Optimization Protocolslambda_callback
. Documentation for these functions is pretty excessive, so if it seems daunting at first, try minimizing the docstrings. Most of the functions are only a couple of linescallbacks
module can be expressed as a lambda_callback
, but they bypass it by inheriting directly from BaseCallback
- You can (and probably want to) do the same thingFor an example of how to implement your own BaseCallback
descendant, I build one for testing the wranglers.predictors
here. Then the two test functions at the end of the same module (first test, and second test) both use the callback class defined above. Note that these two functions don't use Environment
's experiment_callbacks
parameter. However, despite the parameter's documented type of LambdaCallback
, it's perfectly happy with any descendant of BaseCallback
.
G.priority_callbacks
used by the two test functions to push your callback directly to the front of the Experiment's MRO, but that's not the "officially recommended" way to do it - Unless it's necessaryI think it's worth mentioning that the callbacks defined for the library in the callbacks
module are actual critical components of CVExperiment
. Callbacks in HyperparameterHunter aren't just extra things added as an afterthought. They're absolutely central to Experiments, which is why they are dynamically selected and used to define an Experiment by ExperimentMeta
. LambdaCallbacks are designed and intended to be used in the same way. They're not limited to performing "extra" tasks, although they can certainly do that. They exist to enable serious customization of the entire Experiment - Hopefully, so people can add the functionality they need that I couldn't anticipate.
Also note that you probably don't need to worry about the ExperimentMeta
linked above, since metaclasses are sometimes considered Python "black magic". However, it may be worthwhile to see for yourself that ExperimentMeta
is the bridge between all of the callbacks
and CVExperiment
. It literally just picks which callbacks the Experiment needs to inherit and drops them in.
Sorry about that wall of text. If you have any other questions, please don't hesitate to ask here. I'm more than happy to help however I can!
Edit: If you're still using hyperparameter_hunter v2.2.0 (or lower), you should consider switching to the most recent v3.0.0 alpha build. 3.0.0 completely overhauls how datasets are handled via the callbacks.wranglers
and data
modules, so any callbacks dealing with data in v2.0.0 would need to be updated anyway.
@HunterMcGushion Thank you for explanation! However I have never worked with callback functions, thus I don't know, how to use it. So, please, tell me if I wrong at some point. As I understood:
lambda_callback
, which define, when a function should call. And I need to find the one, that will load data every iteration in Experiment. From that point, "on_repetition_start"
is the most suitable one.
-III After that I should write down the function that will return lambda_callback(on_repetition_start=load_data)
in the experiment_callbacks
line.That's a decision I came up with, when I was reading your examples. It seems to be pretty easy to code.
Am I right in my statements above? Have a nice day!
Yeah, that sounds about right if you're using the lambda_callback
function, rather than subclassing BaseCallback
. To clarify on your second point, you can use more than one of the callback functions. However, on _repetition_start
(or maybe on_fold_start
) does seem to be the most suitable in this situation. Also, you'll probably want to initialize your Environment
with the same train_dataset
batch every time.
Again, I'm not sure how this will work out, but feel free to post code here, or to fork the repo and create a new branch, so we can discuss it more easily.
@Utayaki,
I was wrong. I’m sorry, but after looking into it a bit, I’ve just realized that this will require changes to models.Model
as well. I have no idea how I missed that. So this can’t be done with just a lambda_callback
. I’m sorry for sending you in the wrong direction. I am still very much interested in solving this problem, though. I just need to finish the 3.0.0 release first. Sorry, again!
Your library is superious! I want to use it on my main project.
However I'm working with Big Data and I saved numpy arrays in batches to my folder, because I can't store array fulfully on my RAM, while training. However, as I see, the library needs to load the dataframe completely, what I unfortunately can't do.
Is there a solution to that problem?