TypesettingTools / SubInspector

Low-level subtitle inspection library.
MIT License
5 stars 3 forks source link

Events adding #7

Open Youka opened 10 years ago

Youka commented 10 years ago

On the way making everything safer, calling with invalid event indices shouldn't cause crashes. Checking the index and returning an error code on failure would be a fix, but i would prefer a stack for events storing, frees users from initialization, a final number of events and passing indices.

torque commented 10 years ago

Data structures are kind of a weak point for me. How would you implement a stack? I was thinking maybe a linked list, but this is simplistic and probably suboptimal.

I definitely agree that the init method should be done away with. I would like to hear your thoughts on the backing structure.

Youka commented 10 years ago

Stack implementation

torque commented 10 years ago

So that's a yes to the linked list, then. I was initially skeptical because seeking through a linked list isn't exactly fast, and there are scripts out there with thousands of events. That said, there's theoretically no good reason to do random seeking through events, so I'm fine with it.

I think a FIFO queue should probably be used because it's likely users will want to access events in the order they're used.

Youka commented 10 years ago

FIFO wouldn't be better, because data aren't stored together like in arrays, so there's no advantage in caching or read pointer movement. Additionally FIFO would require a pointer to the end for fast element appending - all in all, a stack is the choice.

An alternative would be a vector: a wrapped array which double his memory size if a new element doesn't fit. This could lead to slower building and some memory wasting, but has the fastest memory access.

torque commented 10 years ago

FIFO wouldn't be better from an implementation standpoint, but it'd almost certainly be better from a usage standpoint. As I said, I expect users would probably want to access events in the order that they're added, for which a FILO queue is not useful.

Using a vector is ok, but we'd need to decide on a reasonable default size.

Youka commented 10 years ago

I don't see the problem with my stack implementation, there's a getter function for overall access. A FIFO would just search in another direction for the wished element, nothing else. You think of destroying the list after building it, popping one element out after another, but that's not what we want.

A vector starts with size 0, allocates for the first element 1, then doubles and doubles and so on. Allocating large memory chunks / finding big gaps in process heap is slow and with a lot of data, a lot of memory could be wasted (f.e. with 1025 elements used, 1023 would be garbage).

torque commented 10 years ago

I've given you commit access, as I don't have the time or energy to spend on this right now.

A few comments: your stack implementation is fine. But if you have a script with 100k events (not impossible with karaoke and lots of frame-by-frame typesetting), in order to access the first event, you have to loop through 99.999k stack nodes just to get to the first one. Since, as I am now saying for the 3rd time, I expect users will want to access their data in the order they add it, this seems very suboptimal.

By simply keeping separate read and write pointers (read pointing to the beginning of the list, write pointing to the end of the list) and changing each node to link to the next rather than the previous, the user would be able to iterate through the list in add order without incurring the overhead of having to seek all the way back through it for early reads.

But maybe I'm just overthinking it and the read order doesn't matter that much.

Also, re: vectors: there's no reason it has to start with a size of zero. There are guaranteed to be some events, otherwise there's no point in even using the library. Wasting time reallocating very small amounts of memory is silly. Off the top of my head I'd suggest maybe 500 or so (which I would expect to fit the majority of scripts with dialogue and moderate typesetting).

Additionally, the growth rate does not have to be by factors of two. It's been suggested a growth factor of 1.5 can be more efficient because it allows reuse of previously freed memory (stack overflow).

As a final note, I find it curious you use pastebin.com when github has a (far superior, in my opinion) pastebin of its own.

Hopefully my feedback has been helpful.

Youka commented 10 years ago

The growth factor of 2 was just an example, but the article is right that with 1.5 the memory hole in front the used memory can get reused (which can lead to less memory reservation for a process)... however this includes there's just this one memory chunk and no other allocations which could use the hole too and 1.5 needs more reallocations for a lot of elements than 2 - advantages and disadvantages.

I didn't consider huge scripts and you're right that linked lists are too slow in access, so i'll go with a vector implementation. But vectors should be flexible and not waste a lot of memory when there're tiny ones.