Closed Shine226 closed 3 years ago
@Shine226 I can see in the notebook that you are including file reading and loading as part of the experiment. Are you doing the same in the script?
@cegme Yes, if you want to run the experiment, you need to generate data files first. Later, in the time testing, those data files will be loaded to test.
Loading and generating data should not be part of the experiment.
@cegme The time is only measured on one line of code 'labeled_df.get_subgroup_trends_1lev([all_pearson_obj])'. Time measuring is not dealing with loading and generating data.
is timit.Timer
the same as %timeit
? To make them more comparable, I think you should move the functions from the notebook into a (local to this expt) module and import them into the notebook and import them into the script file so that you can be sure you're running exactly the same code in both cases.
From coding side, they are not same. timeit.Timer needs to pass the parameter to a testing function in the python script
def test(labeled_df, all_pearson_obj):
labeled_df.get_subgroup_trends_1lev([all_pearson_obj])
def test_scalability(...):
...
t = timeit.Timer(lambda: test(labeled_df, all_pearson_obj))
repeat, number = 10, 100
times = t.repeat(repeat, number)
...
if __name__ == '__main__':
...
test_scalability(...)
...
but in Jupyter notebook the code like below
def test_scalability(...):
...
time = %timeit -or10 -n100 -q labeled_df.get_subgroup_trends_1lev([all_pearson_obj])
...
I wrote the notebook version first and uses %timeit in it. Then I moved the functions into script as functions, and other codes except functions into script's main function. Since we want to test on timeit.Timer, I changed code from %timeit to timeit.Timer style in script. The last step is to run the python script by regular 'python xxxxx.py' command.
For the first point in this issue "The time measured in script is about 100 times slower than running timeit in notebook.": it is because the result from timeit.Time() in script isn't divided by the loops number (i.e., we set n=100).
For the second point, run timeit on r = 1, n = 100 to get only one time result.
The time measured in script is about 100 times slower than running timeit in notebook.
The first round time is about 10 times slower than the rest 9 rounds time. This situation contributes to large standard deviation.
The code and results are stored in the scalability_wiggum branch under research_notebooks/scalability_test. Working folder link Python script for measuring time link