cubehouse / themeparks

Unofficial API for accessing ride wait times and schedules for Disneyland, Disney World, Universal Studios, and many more parks
MIT License
540 stars 126 forks source link

5.0 cache contains redundant data #128

Open mledford opened 6 years ago

mledford commented 6 years ago

Context Themeparks 5.0

Describe the bug 5.0 introduces a new caching mechanism and it appears to store data redundantly therefore bloating the cache store. In particular this happens when using the Disney parks object due to the new API required. It seems that there was also an intention to use the disneyRides table but that is not in use at the moment. I'm wondering if just enough couchbase metadata should be stored to know the sync status and then once new data comes in it gets populated into other tables as is done with the calendar and facilities tables? I also wonder if for the themeparks cache table some more defined table structure should be used instead of having to serialize and deserialize json from the store?

cubehouse commented 6 years ago

Hi Michael,

Thanks for checking out the 5.0 branch, greatly appreciate that you're giving it a spin as there are an awful lot of changes that need some testing out!

The new live Disney API is pretty huge as it pulls down meta data for everything you could possibly interact with at Walt Disney World. It's all being stored at the moment because I've tried to write the couchbase fetcher to behave relatively identically to the real couchbase API libraries.

You're right that we could slim this down a bit. We're already extracting out the latest "rev" id so it's easier for look-ups in couchbasesync, that's all that's really needed to figure out if we have latest or not, but we could get rid of the couchbase data blob entirely for standard use to slim the database size waaaaay down. This could be the default option, but users could opt-in to store the whole shebang if they wish (I personally want to save the whole thing as I'm often tinkering with dining searching or the live bus times or something, but I absolutely get why it's not always wanted as it gets pretty huge quickly). Currently though, the calendar and facilities table have "docKey", which is used to reference the "full document" in the couchbasesync table, so this would need a re-shuffle to push that data in some structured way into the relevant tables instead of stuffing it into couchbasesync.

The cache table being just JSON blobs is largely an artifact from the old attempt at a caching system where we were using a generic caching library. I'd like to leave the Cache library doing this generally, but it is a great idea to build something a little more structured for things like caching wait and opening times. The issue is that each park fetches data in various ways, and it often makes sense to cache it, but each park is in a completely different format so serializing JSON into a string often makes sense.

Curious about any further thoughts you have about this. I feel the 5.0 structure is fairly firmed up now, so this is a good time to go over our caching setup, as with the new live Disney API, it's going to be essential.

mledford commented 6 years ago

Hi Jamie,

Thanks for the information. I knew there were likely reasons why some things were the way they are so it's nice to get some context around it. Primarily my thoughts on this were looking at memory and performance. I run some code on a Raspberry Pi. My use case is primarily collection so slimming down sync data and preferring a small disk cache to memory cache with events being pushed to me (#129) might make more sense. But I also understand there are many different ways this library may be used.

Thanks for all the hard work!