Surprise on multiple machines

steventartakovsky-vungle commented 3 years ago

Where is the documentation on the dataset size limitation and how to scale Surprise to multiple machines?

Thanks - Steven

DiegoCorrea commented 3 years ago

Nicholas, thanks for the surprise lib.

I have a question that is likely with the above. I have 21-51 nodes with 24 core each.

I'm trying to use the Dask lib to work in parallel in multiple nodes. So, what I want to do:

For example, I want to Grid Search the NMF (biased=true) with cv=3.
If I set in each parameter 3 option ("n_factors": [50, 100, 150]) in the 8 parameters, we have 3^8=6 561 processes in a CV... 6 561*3=19 683 processes in total.
I can allocate a min of 21 nodes at the same time with 24 cores each. 21*24 = 504 processes work in the same time.
So, I want to distribute 504 processes at the same time in 21 nodes with 24 cores... each process in one core.

OBS: I am using the Slurm job.

`

def grid_search_instance(instance, params, dataset, measures, folds, label, n_jobs=N_CORES):  
    """  
    Grid Search Cross Validation to get the best params to the recommender algorithm  
    :param label: Recommender string name  
    :param instance: Recommender algorithm instance  
    :param params: Recommender algorithm params set  
    :param dataset: A dataset modeled by the surprise Reader class  
    :param measures: A string with the measure name  
    :param folds: Number of folds to cross validation  
    :param n_jobs: Number of CPU/GPU to be used  
    :return: A Grid Search instance  
    """  
    cluster = SLURMCluster(cores=24,  
                           processes=2,  
                           memory='64GB',  
                           queue="nvidia_dev",  
                           project="NMF",  
                           name=label,  
                           log_directory='logs/slurm',  
                           walltime='00:15:00')  
    # cluster.scale(2)  
    cluster.adapt(minimum=1, maximum=360)  
    client = dask.distributed.Client(cluster)  
    print(client)  
    print(cluster.job_script())  
    gs = GridSearchCV(instance, params, measures=measures, cv=folds, joblib_verbose=100)  
    with joblib.parallel_backend("dask"):  
        print(client)  
        gs.fit(dataset)  
    return gs

`

And again, thanks for the surprise and for spend your time reading this question.

NicolasHug commented 2 years ago

Hi all, sorry for the late reply. Surprise doesn't support multi-node training, sorry.

NicolasHug / Surprise

Surprise on multiple machines #373