Unit Test Runtime - Githubissues

chukarsten commented 3 years ago

This spike is intended to track the running of the unit tests. Not sure whether it's worth tracking linux vs. windows separately here. Unit tests are currently hitting ~20 minutes to complete locally and on CircleCI checks. Even though we're moving to GitHub actions, that shouldn't really make the problem any better or any worse.

The intended outcome of this spike is the following:

A documented methodology for profiling the unit tests.
Unit test profiling results posted (on a quip?) as a baseline. (I know Steve Link showed a profiling tool that I've used before that is pretty visual and useful.)
Recommendations provided for the top 5 unit tests to address the performance of.
A quick discussion concerning the above.
Issues filed for the agreed results of the above.

dsherry commented 3 years ago

@angela97lin I am remembering you did some unit test runtime profiling. Do you have any code to share for that and/or thoughts?

Primary goal: speed up CI runtime. So I'd recommend starting with the longest-running CI job (build_conda_package) and finding ways to speed that up.

angela97lin commented 3 years ago

@dsherry Yup, I had to do some unit test runtime profiling for the WW PRs. I ended up writing a simple script that parses the XML files generated in our artifacts (example here) and comparing the one on the branch with the one in main. Then I just printed out the difference in time between the two for each test. Kinda messy but here's the script, super messy and specific to what I was doing, but maybe a good start for this:

import csv 
import xml.etree.ElementTree as ET 

def parse_xml(xmlfile): 
    tree = ET.parse(xmlfile) 
    root = tree.getroot() 
    results = {}
    for test_case in root.findall('testcase'): 
        attributes = test_case.attrib
        name = attributes['name']
        time = attributes['time']
        if float(time) > 1:
            results[name] = time
    return results

def compare_results(main_results, other_results):
    timing_results_dict = {}
    for test_case_name in main_results:
        try:
            time_diff = float(other_results[test_case_name]) - float(main_results[test_case_name])
            timing_results_dict[test_case_name] = time_diff
        except KeyError:
            continue
    return timing_results_dict

def main(): 
    main_results = parse_xml('main.xml') 
    woodwork_results = parse_xml('woodwork_36.xml') 
    time_diffs = compare_results(main_results, woodwork_results)
    sorted_diffs = sorted(time_diffs.items(), key=lambda x:x[1])
    for diff in sorted_diffs:
        print (diff)

if __name__ == "__main__": 
    main()

freddyaboulton commented 3 years ago

I created a dummy draft PR to identify the top 25 longest running unit tests in both windows and linux. I ran it twice to see if the measurements were stable. Here are the results:

Windows 3.7

FIRST RUN
========================== slowest 25 test durations ==========================
628
160.21s call     evalml/tests/automl_tests/test_automl_dask.py::TestAutoMLSearchDask::test_automl
629
150.91s call     evalml/tests/model_understanding_tests/test_partial_dependence.py::test_graph_partial_dependence_multiclass
630
125.13s call     evalml/tests/model_understanding_tests/test_partial_dependence.py::test_partial_dependence_more_categories_than_grid_resolution
631
86.23s call     evalml/tests/model_understanding_tests/test_partial_dependence.py::test_partial_dependence_multiclass
632
84.73s call     evalml/tests/automl_tests/test_automl.py::test_automl_tuner_exception
633
75.75s call     evalml/tests/automl_tests/test_automl.py::test_automl_best_pipeline
634
72.69s call     evalml/tests/automl_tests/test_automl_dask.py::TestAutoMLSearchDask::test_automl_max_iterations
635
65.64s call     evalml/tests/pipeline_tests/test_pipelines.py::test_targets_data_types_classification_pipelines[float64-binary-np]
636
59.10s call     evalml/tests/model_understanding_tests/test_partial_dependence.py::test_partial_dependence_datetime[binary]
637
58.59s call     evalml/tests/component_tests/test_stacked_ensemble_classifier.py::test_stacked_fit_predict_classification[binary]
638
53.48s call     evalml/tests/automl_tests/test_automl.py::test_max_batches_works[regression-False-20]
639
52.80s call     evalml/tests/model_understanding_tests/test_partial_dependence.py::test_partial_dependence_datetime[multiclass]
640
52.51s call     evalml/tests/model_understanding_tests/test_partial_dependence.py::test_partial_dependence_datetime[regression]
641
52.30s call     evalml/tests/model_understanding_tests/test_permutation_importance.py::test_fast_permutation_importance_matches_sklearn_output[DagReuseFeatures-parameters9]
642
50.31s call     evalml/tests/model_understanding_tests/test_permutation_importance.py::test_fast_permutation_importance_matches_sklearn_output[LinearPipelineTwoEncoders-parameters3]
643
45.72s call     evalml/tests/component_tests/test_stacked_ensemble_classifier.py::test_stacked_fit_predict_classification[multiclass]
644
44.53s call     evalml/tests/model_understanding_tests/test_permutation_importance.py::test_fast_permutation_importance_matches_sklearn_output[DagTwoEncoders-parameters8]
645
44.17s call     evalml/tests/model_understanding_tests/test_permutation_importance.py::test_fast_permutation_importance_matches_sklearn_output[LinearPipelineWithTextFeatures-parameters4]
646
43.03s call     evalml/tests/automl_tests/test_iterative_algorithm.py::test_iterative_algorithm_results[False]
647
41.73s call     evalml/tests/model_understanding_tests/test_permutation_importance.py::test_fast_permutation_importance_matches_sklearn_output[LinearPipelineWithImputer-parameters1]
648
41.24s call     evalml/tests/automl_tests/test_automl.py::test_max_batches_works[binary-False-20]
649
41.14s call     evalml/tests/component_tests/test_stacked_ensemble_regressor.py::test_stacked_fit_predict_regression
650
39.99s call     evalml/tests/component_tests/test_utils.py::test_scikit_learn_wrapper
651
39.98s call     evalml/tests/automl_tests/test_automl_search_classification.py::test_automl_multiclass_nonlinear_pipeline_search_more_iterations
652
39.95s call     evalml/tests/automl_tests/test_automl.py::test_max_batches_works[regression-True-20]

SECOND RUN

========================== slowest 25 test durations ==========================
628
282.88s call     evalml/tests/automl_tests/test_automl_dask.py::TestAutoMLSearchDask::test_automl
629
200.48s call     evalml/tests/model_understanding_tests/test_partial_dependence.py::test_graph_partial_dependence_multiclass
630
184.32s call     evalml/tests/model_understanding_tests/test_partial_dependence.py::test_partial_dependence_more_categories_than_grid_resolution
631
104.49s call     evalml/tests/model_understanding_tests/test_partial_dependence.py::test_partial_dependence_multiclass
632
92.91s call     evalml/tests/automl_tests/test_automl.py::test_automl_best_pipeline
633
92.24s call     evalml/tests/automl_tests/test_automl.py::test_automl_tuner_exception
634
87.31s call     evalml/tests/model_understanding_tests/test_partial_dependence.py::test_partial_dependence_datetime[multiclass]
635
86.06s call     evalml/tests/automl_tests/test_automl_dask.py::TestAutoMLSearchDask::test_automl_max_iterations
636
85.57s call     evalml/tests/model_understanding_tests/test_partial_dependence.py::test_partial_dependence_datetime[binary]
637
82.75s call     evalml/tests/model_understanding_tests/test_partial_dependence.py::test_partial_dependence_datetime[regression]
638
62.02s call     evalml/tests/automl_tests/test_automl.py::test_max_batches_works[regression-False-20]
639
60.87s call     evalml/tests/model_understanding_tests/test_permutation_importance.py::test_fast_permutation_importance_matches_sklearn_output[LinearPipelineTwoEncoders-parameters3]
640
60.50s call     evalml/tests/component_tests/test_stacked_ensemble_classifier.py::test_stacked_fit_predict_classification[binary]
641
59.69s call     evalml/tests/model_understanding_tests/test_permutation_importance.py::test_fast_permutation_importance_matches_sklearn_output[DagReuseFeatures-parameters9]
642
59.35s call     evalml/tests/pipeline_tests/test_pipelines.py::test_targets_data_types_classification_pipelines[Int64-binary-pd]
643
55.71s call     evalml/tests/model_understanding_tests/test_graphs.py::test_jupyter_graph_check
644
55.21s call     evalml/tests/model_understanding_tests/test_partial_dependence.py::test_graph_partial_dependence
645
53.26s call     evalml/tests/component_tests/test_stacked_ensemble_regressor.py::test_stacked_fit_predict_regression
646
51.78s call     evalml/tests/automl_tests/test_iterative_algorithm.py::test_iterative_algorithm_passes_pipeline_params[False]
647
51.50s call     evalml/tests/component_tests/test_stacked_ensemble_classifier.py::test_stacked_fit_predict_classification[multiclass]
648
50.91s call     evalml/tests/automl_tests/test_iterative_algorithm.py::test_iterative_algorithm_passes_pipeline_params[True]
649
50.53s call     evalml/tests/model_understanding_tests/test_permutation_importance.py::test_fast_permutation_importance_matches_sklearn_output[DagTwoEncoders-parameters8]
650
50.10s call     evalml/tests/model_understanding_tests/test_permutation_importance.py::test_fast_permutation_importance_matches_sklearn_output[LinearPipelineWithImputer-parameters1]
651
49.42s call     evalml/tests/automl_tests/test_iterative_algorithm.py::test_iterative_algorithm_results[False]
652
49.42s call     evalml/tests/model_understanding_tests/test_partial_dependence.py::test_graph_two_way_partial_dependence

Linux Unit tests

FIRST RUN
========================== slowest 25 test durations ===========================
630
133.58s call     evalml/tests/model_understanding_tests/test_partial_dependence.py::test_graph_partial_dependence_multiclass
631
131.61s call     evalml/tests/model_understanding_tests/test_partial_dependence.py::test_partial_dependence_more_categories_than_grid_resolution
632
112.35s call     evalml/tests/model_understanding_tests/test_partial_dependence.py::test_partial_dependence_multiclass
633
102.16s call     evalml/tests/automl_tests/test_automl.py::test_automl_best_pipeline
634
80.70s call     evalml/tests/automl_tests/test_automl_dask.py::TestAutoMLSearchDask::test_automl
635
67.40s call     evalml/tests/automl_tests/test_iterative_algorithm.py::test_iterative_algorithm_passes_pipeline_params[False]
636
63.33s call     evalml/tests/automl_tests/test_automl.py::test_max_batches_works[regression-False-20]
637
62.44s call     evalml/tests/model_understanding_tests/test_partial_dependence.py::test_partial_dependence_datetime[multiclass]
638
61.97s call     evalml/tests/model_understanding_tests/test_partial_dependence.py::test_partial_dependence_datetime[regression]
639
61.42s call     evalml/tests/model_understanding_tests/test_partial_dependence.py::test_partial_dependence_datetime[binary]
640
57.55s call     evalml/tests/model_understanding_tests/test_permutation_importance.py::test_fast_permutation_importance_matches_sklearn_output[LinearPipelineTwoEncoders-parameters3]
641
57.42s call     evalml/tests/model_understanding_tests/test_permutation_importance.py::test_fast_permutation_importance_matches_sklearn_output[DagReuseFeatures-parameters9]
642
55.39s call     evalml/tests/model_understanding_tests/test_graphs.py::test_jupyter_graph_check
643
54.22s call     evalml/tests/automl_tests/test_automl.py::test_max_batches_works[binary-False-20]
644
53.05s call     evalml/tests/automl_tests/test_iterative_algorithm.py::test_iterative_algorithm_passes_pipeline_params[True]
645
50.63s call     evalml/tests/automl_tests/test_iterative_algorithm.py::test_iterative_algorithm_results[True]
646
49.44s call     evalml/tests/model_understanding_tests/test_graphs.py::test_cost_benefit_matrix_vs_threshold[np]
647
49.14s call     evalml/tests/automl_tests/test_iterative_algorithm.py::test_iterative_algorithm_results[False]
648
49.13s call     evalml/tests/automl_tests/test_automl.py::test_automl_ensembling_false
649
48.80s call     evalml/tests/model_understanding_tests/test_graphs.py::test_binary_objective_vs_threshold[np]
650
48.41s call     evalml/tests/model_understanding_tests/test_permutation_importance.py::test_fast_permutation_importance_matches_sklearn_output[DagTwoEncoders-parameters8]
651
47.64s call     evalml/tests/automl_tests/test_automl_dask.py::TestAutoMLSearchDask::test_automl_max_iterations
652
47.55s call     evalml/tests/model_understanding_tests/test_permutation_importance.py::test_fast_permutation_importance_matches_sklearn_output[LinearPipelineWithTextFeatures-parameters4]
653
47.39s call     evalml/tests/model_understanding_tests/test_permutation_importance.py::test_fast_permutation_importance_matches_sklearn_output[LinearPipelineWithImputer-parameters1]
654
47.04s call     evalml/tests/automl_tests/test_automl.py::test_max_batches_works[regression-True-20]

SECOND RUN

========================== slowest 25 test durations ===========================
630
117.78s call     evalml/tests/model_understanding_tests/test_partial_dependence.py::test_partial_dependence_more_categories_than_grid_resolution
631
114.98s call     evalml/tests/model_understanding_tests/test_partial_dependence.py::test_graph_partial_dependence_multiclass
632
101.67s call     evalml/tests/model_understanding_tests/test_partial_dependence.py::test_partial_dependence_multiclass
633
87.55s call     evalml/tests/automl_tests/test_automl.py::test_automl_best_pipeline
634
65.42s call     evalml/tests/automl_tests/test_automl_dask.py::TestAutoMLSearchDask::test_automl
635
57.63s call     evalml/tests/automl_tests/test_automl.py::test_max_batches_works[regression-False-20]
636
57.36s call     evalml/tests/automl_tests/test_iterative_algorithm.py::test_iterative_algorithm_passes_pipeline_params[False]
637
53.32s call     evalml/tests/model_understanding_tests/test_partial_dependence.py::test_partial_dependence_datetime[binary]
638
53.08s call     evalml/tests/model_understanding_tests/test_partial_dependence.py::test_partial_dependence_datetime[multiclass]
639
52.75s call     evalml/tests/model_understanding_tests/test_partial_dependence.py::test_partial_dependence_datetime[regression]
640
50.27s call     evalml/tests/model_understanding_tests/test_permutation_importance.py::test_fast_permutation_importance_matches_sklearn_output[DagReuseFeatures-parameters9]
641
48.90s call     evalml/tests/model_understanding_tests/test_graphs.py::test_jupyter_graph_check
642
47.88s call     evalml/tests/model_understanding_tests/test_permutation_importance.py::test_fast_permutation_importance_matches_sklearn_output[LinearPipelineTwoEncoders-parameters3]
643
47.15s call     evalml/tests/automl_tests/test_automl.py::test_max_batches_works[binary-False-20]
644
46.17s call     evalml/tests/automl_tests/test_iterative_algorithm.py::test_iterative_algorithm_passes_pipeline_params[True]
645
44.58s call     evalml/tests/model_understanding_tests/test_graphs.py::test_cost_benefit_matrix_vs_threshold[np]
646
44.30s call     evalml/tests/automl_tests/test_automl.py::test_automl_ensembling_false
647
43.54s call     evalml/tests/automl_tests/test_iterative_algorithm.py::test_iterative_algorithm_results[True]
648
42.98s call     evalml/tests/automl_tests/test_iterative_algorithm.py::test_iterative_algorithm_results[False]
649
42.38s call     evalml/tests/model_understanding_tests/test_permutation_importance.py::test_fast_permutation_importance_matches_sklearn_output[DagTwoEncoders-parameters8]
650
42.18s call     evalml/tests/automl_tests/test_automl.py::test_max_batches_works[regression-True-20]
651
41.51s call     evalml/tests/model_understanding_tests/test_graphs.py::test_binary_objective_vs_threshold[np]
652
40.86s call     evalml/tests/model_understanding_tests/test_graphs.py::test_cost_benefit_matrix_vs_threshold[pd]
653
40.20s call     evalml/tests/model_understanding_tests/test_permutation_importance.py::test_fast_permutation_importance_matches_sklearn_output[LinearPipelineWithImputer-parameters1]
654
39.42s call     evalml/tests/automl_tests/test_automl_dask.py::TestAutoMLSearchDask::test_automl_max_iterations

Although the windows unit tests are slower in general, we see the same unit tests take the longest for both windows and linux (test automl with dask, partial dependence/permutation importance, some iterative algorithms tests).

These are some next steps I think we should do based on this:

[ ] Profile these tests and figure out why they take so long and propose/implement changes for speeding them up
[ ] Investigate why these tests take so much longer on our ci workers than locally, e.g test_graph_partial_dependence_multiclass takes 40 seconds on my laptop as opposed to ~120 on the workers
[ ] After addressing the points above, is it worth increasing the number of pytest workers? - @gsheni Let us know that github workers only have two cores. This might explain why the tests take much longer on github than they do locally. Also rules out the possibility of increasing the number of workers. But maybe if we set -n 2 the tests will run faster?
[ ] Split the tests into separate runs and combine the coverage reports.
[ ] Can we "productionalize" @angela97lin 's script to keep track of unit test duration over time? Similar to what codecov does for coverage.

alteryx / evalml

Unit Test Runtime #1815

Windows 3.7

Linux Unit tests