Working with big data - Githubissues

Utayaki commented 5 years ago

Your library is superious! I want to use it on my main project.

However I'm working with Big Data and I saved numpy arrays in batches to my folder, because I can't store array fulfully on my RAM, while training. However, as I see, the library needs to load the dataframe completely, what I unfortunately can't do.

Is there a solution to that problem?

HunterMcGushion commented 5 years ago

@Utayaki, thank you very much for your support of HyperparameterHunter and for raising this issue! Unfortunately, for now, HyperparameterHunter doesn’t handle automatically loading big datasets in batches.

I would absolutely love to add support for this, but my todo list is getting quite long, and I’m completely focused on finalizing v3.0.0 at the moment. So I fear it may be a while unless you (or someone following this issue) is willing to work with me on a solution or submit a PR.

If your dataset isn’t publicly available, can you refer me to a large dataset I can use for testing?

Forgive me if this is a silly question, as I’m quite unfamiliar with processing big data, but is it possible for you to use lambda_callback to build a custom callback that handles loading and unloading your dataset batches? I really have no idea if this will work, but I do know that lambda_callback is deceptively powerful and very useful, since it gets injected directly into the parent classes of CVExperiment. If you’re interested in checking out lambda_callback, let me know and I’d be happy to go into more detail on the other callback/data features in HyperparameterHunter that might help solve this problem.

Thanks again for raising this issue. Until we find a solution, I'd like to keep it open to see how many others have the same problem and want to help fix it.

Utayaki commented 5 years ago

@HunterMcGushion Thank you for answering, and sorry for not writing to you for so long!

I will be glad to work with your library on that issue to make it better, that will be a very interesting experience for me. Talking about lambda_callback, I haven't even known about that feature! If that command is really that powerful, I will definitely use it! However, I can't come up with an idea of proper using of lambda_callback, could you explain this to me?

Talking about a big data, how would you like me to give it for testing?

Have a nice day and again, sorry for keeping you waiting for so long!

HunterMcGushion commented 5 years ago

No worries! Aside from the documentation, which you've already seen, I'd recommend a few other resources:

lambda_callback_example.py - Despite being in "advanced_examples", this is actually a very simple logging example, but it illustrates how to provide lambda_callback to Environment, so it can be used by CVExperiment and Optimization Protocols
callbacks/recipes.py - These are actual implementations of lambda_callback. Documentation for these functions is pretty excessive, so if it seems daunting at first, try minimizing the docstrings. Most of the functions are only a couple of lines
The other two still only scratch the surface, because really every callback in the callbacks module can be expressed as a lambda_callback, but they bypass it by inheriting directly from BaseCallback - You can (and probably want to) do the same thing

For an example of how to implement your own BaseCallback descendant, I build one for testing the wranglers.predictors here. Then the two test functions at the end of the same module (first test, and second test) both use the callback class defined above. Note that these two functions don't use Environment's experiment_callbacks parameter. However, despite the parameter's documented type of LambdaCallback, it's perfectly happy with any descendant of BaseCallback.

Note: You can also use the lightly-documented G.priority_callbacks used by the two test functions to push your callback directly to the front of the Experiment's MRO, but that's not the "officially recommended" way to do it - Unless it's necessary

I think it's worth mentioning that the callbacks defined for the library in the callbacks module are actual critical components of CVExperiment. Callbacks in HyperparameterHunter aren't just extra things added as an afterthought. They're absolutely central to Experiments, which is why they are dynamically selected and used to define an Experiment by ExperimentMeta. LambdaCallbacks are designed and intended to be used in the same way. They're not limited to performing "extra" tasks, although they can certainly do that. They exist to enable serious customization of the entire Experiment - Hopefully, so people can add the functionality they need that I couldn't anticipate.

Also note that you probably don't need to worry about the ExperimentMeta linked above, since metaclasses are sometimes considered Python "black magic". However, it may be worthwhile to see for yourself that ExperimentMeta is the bridge between all of the callbacks and CVExperiment. It literally just picks which callbacks the Experiment needs to inherit and drops them in.

Sorry about that wall of text. If you have any other questions, please don't hesitate to ask here. I'm more than happy to help however I can!

Edit: If you're still using hyperparameter_hunter v2.2.0 (or lower), you should consider switching to the most recent v3.0.0 alpha build. 3.0.0 completely overhauls how datasets are handled via the callbacks.wranglers and data modules, so any callbacks dealing with data in v2.0.0 would need to be updated anyway.

Utayaki commented 5 years ago

@HunterMcGushion Thank you for explanation! However I have never worked with callback functions, thus I don't know, how to use it. So, please, tell me if I wrong at some point. As I understood:

I.To use lambda_callback works with functions, so, I need to write a code that will load data gradually
II. There are plural options of lambda_callback, which define, when a function should call. And I need to find the one, that will load data every iteration in Experiment. From that point, "on_repetition_start" is the most suitable one. -III After that I should write down the function that will return lambda_callback(on_repetition_start=load_data) in the experiment_callbacks line.

That's a decision I came up with, when I was reading your examples. It seems to be pretty easy to code.

Am I right in my statements above? Have a nice day!

HunterMcGushion commented 5 years ago

Yeah, that sounds about right if you're using the lambda_callback function, rather than subclassing BaseCallback. To clarify on your second point, you can use more than one of the callback functions. However, on _repetition_start (or maybe on_fold_start) does seem to be the most suitable in this situation. Also, you'll probably want to initialize your Environment with the same train_dataset batch every time.

Again, I'm not sure how this will work out, but feel free to post code here, or to fork the repo and create a new branch, so we can discuss it more easily.

HunterMcGushion commented 5 years ago

@Utayaki,

I was wrong. I’m sorry, but after looking into it a bit, I’ve just realized that this will require changes to models.Model as well. I have no idea how I missed that. So this can’t be done with just a lambda_callback. I’m sorry for sending you in the wrong direction. I am still very much interested in solving this problem, though. I just need to finish the 3.0.0 release first. Sorry, again!

HunterMcGushion / hyperparameter_hunter

Working with big data #136