apple / turicreate

Turi Create simplifies the development of custom machine learning models.
BSD 3-Clause "New" or "Revised" License
11.19k stars 1.14k forks source link

Evaluating Model Takes a Long Time #1785

Open nseidl opened 5 years ago

nseidl commented 5 years ago

Hi,

Why does it take almost twice as long to evaluate the model on 2k samples than it does to train on 17k samples?

I'm using TuriCreate 5.4 (Python 3.6) on Ubuntu16.04 with Cuda8.0 (410.79). I have access to a machine with 4 GPUs and 80GB of RAM.

Here's my setup:

pip install turicreate
pip uninstall -y mxnet
pip install mxnet-cu80==1.1.0
echo "export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH" >> ~/.bash_profile
source ~/.bash_profile

Here are config settings I've overriden:

tc.config.set_runtime_config('TURI_FILEIO_MAXIMUM_CACHE_CAPACITY', 70*1024*1024*1024)
tc.config.set_runtime_config('TURI_FILEIO_MAXIMUM_CACHE_CAPACITY_PER_FILE', 70*1024*1024*1024)
tc.config.set_runtime_config('TURI_NUM_GPUS', 4)
tc.config.set_runtime_config('TURI_DEFAULT_NUM_PYLAMBDA_WORKERS', 64)
tc.config.set_num_gpus(4)

I train on ~17K .jpgs (4000x3000, in SFrame), validate on ~2k .jpgs, and evaluate on ~2k .jpgs.

new_model = tc.image_classifier.create(
    train_sframe, # ~44GB
    'label',
    batch_size=128,
    max_iterations=num_iterations,
    model=base,
    validation_set=validation_sframe, # ~4.5GB
    verbose=True
)

I then evaluate the model with the handy .evaluate(...):

evaluation = new_model.evaluate(test_sframe) # ~4.5GB

Here is a log of the progress. As you can see, it takes ~55min to train the model on 17k, and ~95min to evaluate the model on 2k.

05:57:09 creating and training model
06:53:25 done training model, took 3375.67470098s
06:53:26 saving model
06:53:39 evaluating model
08:28:18 done evaluating model, took 5679.14030695s
08:29:27 saving evaluation

Why does it take almost twice as long to evaluate the model on 2k samples than it does to train on 17k samples?

(I'd like to add that I'm not hitting any resource caps, as far as I know. See below screenshot of CPU and RAM usage (don't have GPU usage yet))

Screen Shot 2019-04-25 at 9 50 59 AM
srikris commented 5 years ago

Thanks @nseidl for filing such a detailed and informative bug report.

TobyRoseman commented 5 years ago

I am able to reproduce this issues using the 101 category dataset mentioned in our user guide (~18k images, 101 classes). Using the full dataset create(...) takes about 9 minutes. Passing the full dataset to evaluate(...) takes about 40 minutes.

Also once the deep features are extract in evaluate(...) nothing further is printed. So no further info/progress is printed after the first few minutes.

This slowness is caused by the Image Classification Evaluation change (#1335). Using the release before that change (5.2.1) evaluate(...) on the full dataset takes about 6 minutes.

Besides the time to extract the deep features, nearly all the time is spent calculating confusion matrixes. Each evaluate call calculates three different confusion matrixes. Each confusion matrix does 2 applies over the entire evaluation set, for each label in the evaluation set. This means for 101 classes we are doing 606 full passes over the evaluation set (num confusion matrixes 2 num classes = 3 2 101 = 606).

nseidl commented 5 years ago

This slowness is caused by the Image Classification Evaluation change (#1335). Using the release before that change (5.2.1) evaluate(...) on the full dataset takes about 6 minutes.

Besides the time to extract the deep features, nearly all the time is spent calculating confusion matrixes. Each evaluate call calculates three different confusion matrixes. Each confusion matrix does 2 applies over the entire evaluation set, for each label in the evaluation set. This means for 101 classes we are doing 606 full passes over the evaluation set (num confusion matrixes 2 num classes = 3 2 101 = 606).

So, if I don't desire any sort of confusion matrix information, and am only interested in the following metrics, what would be the most performant (fastest) way of obtaining them?

desired_metrics = ['accuracy', 'auc', 'precision', 'recall', 'f1_score', 'log_loss']
evaluation_results = {}
for metric in desired_metrics:
    report('calculating metric {}'.format(metric))
    evaluation_result = new_model.evaluate(test_sframe, metric=metric, verbose=True, batch_size=128)
    evluation_results.update(evaluation_result)

I tried this, but received the following error: (let me know if I should file a different GitHub issue for this one)

08:16:10 evaluating model
08:16:11 calculating metric accuracy
Performing feature extraction on resized images...
Completed  128/2143
...
Completed 2143/2143
Traceback (most recent call last):
  File "run.py", line 102, in <module>
    evaluation_result = new_model.evaluate(test_sframe, metric=metric, verbose=True, batch_size=128)
  File "/usr/local/lib/python2.7/dist-packages/turicreate/toolkits/image_classifier/image_classifier.py", line 675, in evaluate
    predictions = metrics["predictions"]["probs"]
KeyError: 'predictions'
abhishekpratapa commented 5 years ago

@nseidl! Thanks again for the detailed report and sorry for the inconvenience.

There's a "quick" fix I put up to get you unblocked for the time being. If you compile from source and want to use the script you've posted above, it should work with this change and should be significantly faster than the current implementation for single metrics.

Another suggestion as @TobyRoseman posted is to use a version of TuriCreate before 5.2.1. This may be even faster in terms of the evaluation time, though there may be some issues using previous versions of TuriCreate; issues which may have been resolved in later versions.

We're still investigating the matter to figure out what the right thing to do is and we'll get back with you promptly!

TobyRoseman commented 4 years ago

@abhishekpratapa - is this fixed in 6.0?