BurntSushi / imdb-rename

A command line tool to rename media files based on titles from IMDb.
The Unlicense
224 stars 20 forks source link

Is this even helping? #5

Closed yoursvivek closed 5 years ago

yoursvivek commented 5 years ago

I was going through source code and found that rating data (rating and votes) is appended into id and inserted into fst::Set. My question is why not stick to fst::Map instead?

https://github.com/BurntSushi/imdb-rename/blob/f2a40bfcc39ac01201b896044f31318c0b82db82/imdb-index/src/index/rating.rs#L98

To my understanding, this new set element with 18 bytes (first nine for _imdbid, followed by 0x00, and 4 bytes each for rating and votes) will bloat the fst as last 9 byte tail will always have it's own separate state transformation chain. Instead, if map is used, u64 space of value can store both rating and votes without bothering about extra space wasted in storing fst transformations. Help me understand, if I'm thinking the right direction?

I came to this observations after I did a similar thing on Have I Been Pwned dataset.

-rw-r--r--   1 vivek  admin  18648381081 Oct 18 17:49 pwned-passwords-ordered-by-hash.set.fst
-rw-r--r--   1 vivek  admin  18679934062 Oct 17 23:46 pwned-passwords-ordered-by-hash.map.fst
-rw-r--r--   1 vivek  admin  22793984105 Jul 11 06:07 pwned-passwords-ordered-by-hash.txt

.txt file is sha1 hash and count data like following:

000000005AD76BD555C1D6D771DE417A4B87E4B4:4
00000000A8DAE4228F821FB418F59826079BF368:2
00000000DD7F2A1C68A35673713783CA390C9E93:630

.map.fst uses 20 bytes sha1 as key, count as u64 value. In .set.fst 20 bytes of sha1 are appended with u32 as big-endian bytes. My original intent was saving it as sets would save some space which would reduce file size and therefore would incur lesser page faults when reading mmaped fst data. Instead I ended up without and practical saving (0.17%?).

If there is something else that I'm not seeing, let me know.

BurntSushi commented 5 years ago

@yoursvivek Sorry, I'm having trouble parsing your comment here. It sounds like you're saying an fst map will use less space, but your example shows the exact opposite? That is, your fst set is using 18,648 MB but your fst map is using 18,679 MB. Honestly, that's not much difference at all.

Or... maybe you're asking a conceptual question here? Are you saying that you expected the map to use less space, and you're wondering why it isn't? If so, then yeah, that is fairly easy to explain. The issue with a map is that maps store the output values on the transitions themselves, which means there is less opportunity for reusing states. Remember, a state can only be reused if all of its outgoing transitions are the same. In the case of a map, the output values on those transitions are part of determining equality between transitions. So yeah, the set does need to encode more states naively, but you're not accounting for the limits on state reuse in maps. It's a bit hard to reason about without going and looking directly at the finite state machine.

Also, you might be overestimating how much extra space is used by the set in this case. I can't remember all the specifics, but in the very best case, a state in an fst is represented by a single byte. I think that case is for states with exactly one transition, which would probably correspond to the tailing states in the set.

yoursvivek commented 5 years ago

Let me try again. Preface: I tried to save space in my pwned dataset by moving from map to set but using 4 extra bytes in keys, instead of 8 byte value of map. I forgot about bytes required in transition states. Also given that all sha1 hashes in my file were unique, the byte representation of count will be a tail with no chance of reuse of state. I learned my lesson.

Now over to imdb-rename code, I see it as a similar data set. Unique keys (imdb id), and same sized value payload (4+4 bytes). Since keys are unique, there are no chances of reuse of state when appending value payload to keys themselves. Set representation will also incur cost of saving state transitions for each byte of value payload appended to key. then there is a nul byte too.

That begs the question: Why did you use sets instead of maps in saving ratings fst in imdb-rename? I'm just curious. Is there some other gain?

BurntSushi commented 5 years ago

I don't know. Maybe I started with maps and then switched to sets because it uses less space? That's unlikely though, since ratings are I think the smallest fst in the index, so I doubt I would care.

Another possibility is that I thought it might be possible for one IMDb title to be associated with multiple ratings. The type signatures don't encode that possibility, but I wasn't sure if the data guaranteed it.

Another possible explanation is that I just copied the code from representing the TV show index, which does require this sort of representation. For example:

https://github.com/BurntSushi/imdb-rename/blob/f2a40bfcc39ac01201b896044f31318c0b82db82/imdb-index/src/index/episode.rs#L100-L113

yoursvivek commented 5 years ago

Hmm, Ok. So in general, when should one use fst sets and when should one use fst maps?

PS: I really need to finish reading your Automata article.

BurntSushi commented 5 years ago

So in general, when should one use fst sets and when should one use fst maps?

I don't think there exists a general answer. There is probably also an argument about access time here. It's probably faster to find a value in a map using get than it is to execute a range query and decode the tail of your key to get a value.