247-ai / FlashML

FlashML from [24]7.ai: A library for automated model training on Apache Spark
Apache License 2.0
1 stars 3 forks source link

Introduce option to override default parallelism #20

Closed samikrc closed 4 years ago

samikrc commented 4 years ago

Currently the default parallelism is set to 3, which can't be overridden, and is used in a bunch of places, thereby slowing down the training. Introduce experiment.parallelism, where a number can be specified, which will be used in all places.

samikrc commented 4 years ago

Merged pull request with this feature. Closing.

samikrc commented 4 years ago

Although the level of parallelism can now be controlled by the user, looks like the number of threads launched in, say, a CV experiment is same as the number of variation for a parameter. Looks like, not all the jobs are getting generated, and hence all the parallelism is not getting used.

For e.g., with a parallelism of 6, and the following config for SVM CV:

      "svm": 
      {
        "plattScalingEnabled": true,
        "regparam": [0, 0.001, 0.005, 0.01],
        "maxiter": [1000],
        "standardization": [true]
      },

I am seeing only 4 threads getting launched:

20/08/17 11:12:57 INFO tuning.CrossValidatorCustom: Starting cross-validation runs.
20/08/17 11:12:58 INFO tuning.CrossValidatorCustom: Training CV set 1 of 5 with parameter map: maxIter=>1000/regParam=>0.01/standardization=>true
20/08/17 11:12:58 INFO tuning.CrossValidatorCustom: Training CV set 1 of 5 with parameter map: maxIter=>1000/regParam=>0.001/standardization=>true
20/08/17 11:12:58 INFO tuning.CrossValidatorCustom: Training CV set 1 of 5 with parameter map: maxIter=>1000/regParam=>0.0/standardization=>true
20/08/17 11:12:58 INFO tuning.CrossValidatorCustom: Training CV set 1 of 5 with parameter map: maxIter=>1000/regParam=>0.005/standardization=>true
samikrc commented 4 years ago

Added issue #22 with the above problem. Closing this issue (since this was concerning the implementation of user defined parallelism).