GabeMillikan / Wizard101Trivia

Easily answer wizard101 trivia questions
10 stars 1 forks source link

Inefficiencies in the code #1

Closed volovikariel closed 2 years ago

volovikariel commented 2 years ago

Heyo!

Currently quite busy so haven't taken the time to write a PR yet, but still wanted to write down some things I noticed that could maybe be improved.

Minor speedup for storing the fuzzy ratios

You're currently only caring about the top 3 fuzzy matching searches, and yet you're going through the trouble of going through each and every Quiz question, calculating its ratio, then sorting the results, and taking the first three elements. This results in O(n*log(n)) time - where n is the number of quiz question/answer pairs you have.

You could use a max heap of size 3, and this'll make it so that as you go through all of the entries and calculate their fuzzy ratios, it would only keep the top 3 ranking ones. This would result in O(n*log(k)) time, where k is a constant (in this case 3).

You could also just use selection sort 3 times (just scan through the array finding the minimum and put it to the 0th index, then the 1st, etc.). This would be 3*O(n) - it's simpler than using heaps if you'd rather do it this way.

Potential minor speedup for matching the copied string to its value

I'm looking into a way to do some pre-processing on the Quiz questions to be able to tell which one matches the copied text the quickest. It'd be nice to have very quick querying by using the fact that we're always considering substrings.

Doing a fuzzy search will end up having to go through every string in its entirety, checking for substrings will simply take two pointers substring distance apart and scan through the strings (in a clever way) doing very few comparisons (think Z-algorithm or something of the sort) - but I think that given our restrictions, we could go much faster. If I find anything, I'll let you know.

Variable naming

I think that being able to quickly tell what a variable represents is a good thing.

Your variables all make sense with context, e.g: s and q representing sentence (or subsequence/substring) and question. But having it be str or question/qst would have it make sense without any context. w for window is another example. On line 814, the w.after(ms, f) is more cryptic than I would personally like.

File structure

Perhaps putting the QUIZZES dictionary near the bottom of the file would make it easier to quickly load up the file and look at the code? That way, you wouldn't need to scroll through it all. Just a thought.

I don't write Python, so I can't provide any input for anything related to that, I hope nonetheless that I was able to provide some good help!

GabeMillikan commented 2 years ago

Thanks for the feedback! I'll definitely go through and rename the variables to be more useful, along with commenting to explain what's going on. I'll edit the fuzzy search so that it only keeps track of the top three results, but I can't think of a good way to optimize it past that at the moment.

Currently, the giant QUIZZES dict is embedded directly in the file (instead of a separate .json file) because I wanted people to be able to download just the .pyw and run it anywhere, without having to worry about a directory structure. What do you think about a solution that makes an HTTP request and retrieves a json file from GitHub? That would simplify the file significantly, and would allow the program to auto-update the supported quizzes list.

volovikariel commented 2 years ago

What do you think about a solution that makes an HTTP request and retrieves a json file from GitHub?

I think that'd work. Requiring an internet connection won't hinder them from somehow solving it while offline (the user would always need to have an internet connection to attempt the quizzes anyways).

However, it would be nice to minimize the amount of downloads. It's always nice to have a small application that just works! Here's an option I can think of, but there's a lot of other ones which I don't know of - not sure what's best practice here :thinking:

You could start by checking if the JSON file exists locally, if it doesn't, you'd download the data and create the file. If the file did already exist, you'd check its version number and compare it to that of the one hosted elsewhere. If the version numbers are different, then there'd be a prompt (or you could simply download it directly) and it'd overwrite the file found locally.

Sidenote

Why do we do

import tkinter as tk
from tkinter import ttk

Could you put them both on one line? Not sure if that's possible in Python. I find it a tad odd that we alias tkinter to tk and then use tkinter rather than tk.

GabeMillikan commented 2 years ago

As for the from tkinter import ttk: It's not possible in Python to simplify those statements any further. Since ttk is only used twice, I'll remove the import and use tk.ttk instead, since that's probably easier to read for most people.

I like your version checking idea, I think I'll create two new files in this directory quiz_data/version.txt and quiz_data/quizzes.json. The script can download version.txt which will only be a few bytes like 1.0 and it can compare to whatever it has locally. On the first install, this will have to be a blocking operation, but everything after that it can happen completely in the background and can even silently fail while continuing to use an outdated lookup table. I'll implement these changes.

GabeMillikan commented 2 years ago

A change to this repository should be reflected in the clients, so long as the version number is changed here.

I think that everything mentioned in this issue is now in working order so I'll close it, but will re-open if anything is found to be broken. Thanks again for the feedback!

volovikariel commented 2 years ago

Perhaps having the version number be at the top of quizzes.json file (the first line) or have it be present in the JSON file's name would make things easier for you (e.g: quizzes_v_1_0.json).

I've been looking into the quick substring matching and I've been making progress. Will update when I got a concrete algorithm!

Nice work with the updates :+1:

GabeMillikan commented 2 years ago

It would be easier, yes, but then the client would have to download the entire quizzes file every time that the program started up. The reason for having the version in a separate file is that the client can first check the version, then determine if it needs to download the other (larger) quizzes.json file. Granted, the file is less than 100 kb so it shouldn't really matter, but it's the principal.

P.S. I added support for some new trivia and the program successfully updated itself in the background

volovikariel commented 2 years ago

Ahh - I didn't consider the need to actually fetch the entire quizzes.json file before being able to check its version. Maybe you could have a separate end point, but at that point, might as well go with what you have now until a better solution comes to mind.

Awesome stuff with the auto update!

volovikariel commented 2 years ago

I've done some research into the "Potential minor speedup for matching the copied string to its value" of my initial issue posting.

There are ways to achieve massive speedups, but it will most probably slow things down for such a small dictionary/sentence length pairing (because of the overhead introduced by using these data structures).

If you like to write such applications, and plan on making a general trivia one in the future, and want to have a database with millions of entries and very long sentence lengths, here are some keywords you can look into and some video links+timestamps so that you can write (and understand) a program that will do things incredibly quickly for you. (I personally didn't even know that we can get such fast lookup times and such low space usage was possible before looking into all of this!)

Our problem is actually extremely similar to that of finding people who have a DNA subsequence that match one another. It's cool stuff but it's a lot of information to take in!

The general problem description: finding all substrings within a single document (which can be extended to finding all substrings in a range of documents). http://courses.csail.mit.edu/6.851/spring12/lectures/L16.html?notes=6 @55:40

Ladder decomposition/jump pointers: http://courses.csail.mit.edu/6.851/spring12/lectures/L15.html?notes=7 @52:10

Suffix Array goodness: http://courses.csail.mit.edu/6.851/spring12/lectures/L18.html @24:00 - ish

Keywords to google and learn about if interested:

Basics Tries/Compressed Tries, (Generalized) Suffix Trees, Suffix Arrays

More advanced Van Emde Boas trees, Range Minimum Query, ladder decomposition, Cartesian trees, lowest common ancestor(s) problem, level ancestors

Hope you have a great holidays, and I wish you good fortune!