egbertbouman / youtube-comment-downloader

Simple script for downloading Youtube comments without using the Youtube API
MIT License
884 stars 223 forks source link

Improve search dict performance #67

Closed seppeljordan closed 3 years ago

seppeljordan commented 3 years ago

TL;DR

Provide a new implementation for search_dict that is slightly faster without sacrificing readability (hopefully).

Long-winded explaination

This PR is split into two commits. I tried to explain what I did in the commit messages. I will copy them here for reference:

    Implement unit tests and benchmark tests for search_dict

    The unit tests implemented with this commit aim to provide some
    trust in the correctness of search_dict's implementation. The
    benchmark test was done to provide some reference point for the
    performance of other implementations to compare agains.
    Implement search_dict in terms of a stack instead of recursive calls

    This commit aims to improve the performance of search_dict. To measure
    the performance in at least some capacity a previous commit introduced
    a test benchmark.

    The idea behind the new implementation is to use a stack when searching
    dictionaries instead of using recursive calls. The author would expect
    this approach to be slightly faster since slapping something at the end
    of a list is expected to be fast than allocating a new stack frame for
    a function call. The benchmark suggests that the new implementation to
    be 25% faster then the old one.

The aforementioned benchmark can be executed by installing the requirements from requirements-testing.txt and executing pytest.

$ git checkout 16b1f0fec02f30afdfb1486e10a436780b55d4a4
Note: switching to '16b1f0fec02f30afdfb1486e10a436780b55d4a4'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 16b1f0f Implement unit tests and benchmark tests for search_dict

$ pytest
========================================================================== test session starts ==========================================================================
platform linux -- Python 3.8.6, pytest-6.1.2, py-1.9.0, pluggy-0.13.1
benchmark: 3.2.2 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/me/youtube-comment-downloader
plugins: benchmark-3.2.2
collected 6 items                                                                                                                                                       

tests/test_search_dict.py ......                                                                                                                                  [100%]

----------------------------------------------------- benchmark: 1 tests ----------------------------------------------------
Name (time in us)              Min       Max      Mean   StdDev    Median     IQR  Outliers  OPS (Kops/s)  Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------
test_benchmark_search     101.0710  227.1309  106.7756  13.0743  102.2930  0.7059  842;1346        9.3654    7207           1
-----------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
=========================================================================== 6 passed in 1.93s ===========================================================================

Now with the new implementation

$ git checkout db649f18b22c2e05402c5ec0f9ddc1ea5b8e2077
Note: switching to 'db649f18b22c2e05402c5ec0f9ddc1ea5b8e2077'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at db649f1 Implement search_dict in terms of a stack instead of recursive calls

$ pytest
========================================================================== test session starts ==========================================================================
platform linux -- Python 3.8.6, pytest-6.1.2, py-1.9.0, pluggy-0.13.1
benchmark: 3.2.2 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/me/youtube-comment-downloader
plugins: benchmark-3.2.2
collected 6 items                                                                                                                                                       

tests/test_search_dict.py ......                                                                                                                                  [100%]

---------------------------------------------------- benchmark: 1 tests ---------------------------------------------------
Name (time in us)             Min       Max     Mean   StdDev   Median     IQR   Outliers  OPS (Kops/s)  Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------
test_benchmark_search     73.4831  247.2791  81.5870  14.9009  75.0150  2.6080  1022;1972       12.2569    8795           1
---------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean
=========================================================================== 6 passed in 1.89s ===========================================================================

I know that the performance of search_dict is not the most important thing, but IMHO being slightly faster makes the program just a little bit better.

egbertbouman commented 3 years ago

I haven't considered the performance of the script at all, to be honest. In fact, I've added calls to sleep to make it go slower so as to avoid Youtube from blocking us. Then again, I agree that the implementation has improved :+1:

seppeljordan commented 3 years ago

Thank you for merging this. I noticed the sleep statements.

I guess reducing the runtime cost reduces at least energy consumption.