Improve loading pages (library, titles)

Leeingnyo commented 3 years ago

Implementations

To improve page loading, I cached lots of data. The principles are:

Avoid accessing info.json files
Reuse results of method that called many times with same arguments

Misc.

cache entry_cover_url

Like entry_display_name, cached entry_cover_url would improve rendering src/views/components/card.html.ecr. This is very effective.

cache `display_name` of title

This would avoid to access info.json

change where sort opt are saved

It's okay that sort options are saved when accessing library, book with parameters. I removed unnecessary actions. This would avoid to access info.json

Caching Library

This utility implemented at cache.cr helps cached data to be preserved after scanning. This caches:

cover_url of each entry
results of deep_read_page_count of each title
sort options from info.json files

I think the second one is quite effective.

Sorted Entries Cache

There are lots of calls sorted_entries, which takes long (222 entries took 80ms on my machine). Besides it called three times with the same arguments when accessing title page. Caching them is effective since they are not changed until modifying entries. Currently, this option is not enabled as default. You should set the flag true for enabling. And the cache size option doesn't guarantee the machine to consume memory precisely.

Effectiveness

Test environment

I loaded a title page that has 222 entries (no nested title, about 11GB) on SATA3 SSD. I'm sorry that I have tested them only once

As is

first access: took 1140ms

second access: 1119ms (not changed)

took long, almost same time

With only caching entries' cover url

first access: 700ms (x1.6 faster)

second access: 735ms

With full implementation except sorted entries cache

first access: took 233ms (x4.9 faster)

second access: took 238ms

With full implementation

first access: took 115ms (x10 faster)

second access: took 50ms (even faster)

This resolve #186 partly.

hkalexling commented 3 years ago

Thanks for the PR! I was trying to review this but got confused about two points. I am probably missing something so it would be great if you could explain it a bit more.

Why did you calculate the cache item instance size as @value.size * (instance_sizeof(String) + sizeof(String)) + @value.sum(&.size) + instance_sizeof(SortedEntriesCacheEntry)? I don't understand why you are adding instance_sizeof(String) and sizeof(String) together, and why you do @value.sum(&.size).
Why do we need cached_xx and cached_xx_previous? And when clearing the cache why do we do it in two steps (clear and clean)?

Leeingnyo commented 3 years ago

instance_sizeof(SortedEntriesCacheEntry): sizeof(Time) * 2 + sizeof(String) + sizeof(Array(String)) + sizeof(object id)

Since String and Array are a Reference type, they are somewhat pointers. Their sizeof return sizeof pointer (4bytes or 8bytes). I tried to calculate their actual data size. I thought @value, which of type is Array(String), would have @value.size of String, so I added them (@value.size * sizeof(String)). Again, the String is a reference type, I added instance_sizeof(String) too. In String, there would be a pointer that point to memory allocated for string data. So I added @value.sum(&.size) for allocated memory.

It's not perfectly accurate (I forgot to add instance_sizeof and data of @key, and there would be other stuff), but I think this would be close to actual size of memory consumption. Let me know if there is already fancy method to do this in the Crystal

After scanning, entries and titles could be added or removed. We should invalidate cached_xx for removed entries and titles and preserve cache for remained ones (I intend this). At clear stage, it backs up cached_xx to cached_xx_previous and initializes cached_xx. In scanning process, titles and entries call to move(=preserve) their cached data from cached_xx_previous to cached_xx (entries and titles). After scanning, the clean stage initialize cached_xx_previous, hoping to be GCed.

I think my naming sense is not good 😢

hkalexling commented 3 years ago

Ah now it makes more sense. Thanks for the clarification!

String.size returns the number of unicode characters in the string, so the value might not be equal to the actual bytesize, but I guess we can assume that we will only be storing ASCII characters.

I think the SortedEntriesCache class can be easily generalized to a general in-memory key-value DB to cache other metadata as well. In fact I think we can implement the same two-stage clearing method you mentioned into it and store the title info data in it as well. In this way we can have a size limit that applies to all types of caches. What do you think?

I will test the current implementation and try to improve the naming a bit. I can then start exploring the idea above.

Leeingnyo commented 3 years ago

I just notice there is the String#bytesize to get real byte size!

I think the data cached by InfoCache (i.e. display_name, sum of deep titles progress, cover_url) are small enough to stay in a memory. they are just additional three or four variables on each titles and entries and actually I want them not to be invalidated. but, yes, in other words, they are small to be cached in LRU cache system.

I agree that generalized cache entry would be good. Let me try to do. It takes a time learning the Crystal 😄 I'll ask you a help when having problems. By the way, I checked the total size of info.json from my huge library (12000 entries, 500GB, 1 user), which is only 1MB (as text files, a info.json of title which has 1000 entries has the size about 80KB). It seems okay that whole info.json files could be cached in a memory. How about using the generalized LRU cache at TitleInfo#new?

hkalexling commented 3 years ago

Haha I didn't know about String#bytesize either but it's definitely a better choice.

The only problem I have with the current implementation is that we have two cache systems InfoCache and the LRU cache, while the LRU cache can be easily extended to replace InfoCache. I think having a single unified class for caching will make future maintenance and development easier.

Yes I agree that we can cache the metadata into the generalized LRU cache at TitleInfo#new. Since the generalized LRU cache will be a key-value store, we can use keys like [id]-coverurl and [id]-progress to store and query the data. For SortOption we can serialize it to JSON and cache it.

Let me try to do. It takes a time learning the Crystal 😄 I'll ask you a help when having problems.

Sure looking forward to it! As always, thank you for the effort you put into this :D

Leeingnyo commented 3 years ago

I Fixed my implementation

Implementation

`LRUCache`

This is general LRU Cache that is key-value store. the value is contained in CacheEntry(SaveT, ReturnT) for handling access time.

On CacheEntry(SaveT, ReturnT) , SaveT is a type of storing in the cache and ReturnT is a type of value that return value of get. e.g. SaveT of cache entry for sorted_entries is Array(String), while ReturnT of that is Array(Entry). If you have a cache entry of which SaveT and ReturnT are different each other, extend CacheEntry(SaveT, ReturnT) and override self.to_save_t and self.to_return_t. There are aliases, CacheableType and CacheEntryType. When you want more cacheable type, add them to these alias.

the generate_cache_entry method helps to make a cache entry easily.

It's a return value of get should be checked by is_a?.

I've cached

sorted_entries
sort_options
progress sum of titles
cover_url of entries and titles
whole TitleInfo

Fixmes

Tuple#instance_size is not implemented well.
auto-generatable CacheableType and CacheEntryType (?)

hkalexling commented 3 years ago

Thanks! The code looks great overall, and I just have one question - why do we need to cache the sort options and cover URLs separately when we already have the TitleInfo in the cache? You can easily get the JSON from the cache and then retrieve the information you need. This would be a little bit slower but I think the difference would be marginal. I am missing something?

auto-generatable CacheableType and CacheEntryType (?)

Perhaps we can somehow achieve this with the Crystal inherited macro. I will look into this.

Leeingnyo commented 3 years ago

I agree that caching sort options and cover URLs separately to LRU cache is verbose. So I removed them and tested it, It took about 2ms to restore cached info.json string to JSON object of which title has about 220 entries, and took about 10ms more to load title page. I think this is fair enough.

I'm going to cache the cover_url in title instance like the cached_display_name. Because, for getting cover URLs of its nested titles, it retrieves info.json of titles. I think a member variable cached_cover_url is much cheaper.

hkalexling commented 3 years ago

Again thanks a lot for the PR! I took the liberty and made some small changes.

Leeingnyo commented 3 years ago

Great changes! Thanks for merging

getmango / Mango