Closed switchupcb closed 3 years ago
Issues The _test_srtremove.py needs to be able to import the functions from _srtremove.py to work in this library. The functions themselves are tested using the test file and the CLI.
Future Development Once merged, I'd like to add an argument in combination with 'fixed-timeshift' to specify whether removals are "inserted" (where removing adjusts caption times). Currently, all removals are "replaced" (where the time of each caption isn't effected). This would allow us to remove and adjust times at the same time.
Perhaps, the split function could use it's own file and certain methods (contains, etc) could be placed in a util for use in basic-tools (add, copy, etc).
Hey! A few general thoughts from a cursory glance, since this will need some overall massaging.
--t
makes for a confusing API, both internally and externally (eg. get_timestamp
knows way too much). Use separate arguments for separate data.--t
should be -t
, double dashes are for long options.isinstance
sprinkling around -- it's necessary in some edge cases, but by and large the code shouldn't be type-generic.split
operates the way it does. Why do you do a double append with index 0?I have more questions but I suspect it's not super helpful to list them all, since this will need some reworking anyway.
My general reaction is that there are a lot of premature micro-optimisations here, and I'm not entirely convinced that all the functionality is justified. Could you please go more into your motivations for each piece of functionality and the profiling you did to justify the tradeoff of some of these micro-optimisations? :-)
This pull request introduces 5 alerts when merging 07a29c115651f637d9761a9edcd85f368b774542 into 2d9f734ff52ac4b162cd0ef2ee84c793530e10f0 - view on LGTM.com
new alerts:
Thanks for reviewing the pull request so fast! =D
Here are some answers to the questions you had. Be sure to advise me on the next changes we need to make.
I use captioning with music. Automated tools aren't good enough to caption music (even without the instrumental). What I'll do is place the vocals and lyrics in an automated tool such as YouTube's and then adjust the times it provides. This means I must remove captions, copy choruses, etc.
With Adobe Premiere's new update you can remove mass captions but you can't copy them. You can't remove by "insert". Soon, they will have an accurate speech to text tool (better than YouTube's). However, no program has the other features I need (remove by insert, copy, etc).
Implementation
This seems significantly overcomplicated to me. For example, what's the use case for "remove all to the last N subs" where you don't already know the absolute value in question? I get the value of "remove subs from {time,index} N to M", but the rest don't strike me as something that there's much demand for. A lot of the conditions and internal logic is also more complex than it needs to be because of that.
It's actually an unintended result of the simple implementation provided below. I've kept it to help lazy people. For some captions, we want to remove FROM a specific timestamp without having to check the file. A lot of time is wasted opening the file and checking the last index over time. Also, the last index in the file could be wrong: You can't tell where every index ends easily using eyes/code because captions aren't sorted by end times.
Implementation (Algorithm)
This results in 5 edge cases (4 sequential, 1 non-sequential). With the new changes, only 2 are directly covered to simplify things (but some may be indirectly covered). When the user doesn't specify a second timestamp, we remove from timestamp one to end (because there's no timestamp two).
See RCI and RCT Optimization below for more information.
Parser
Overloading --t makes for a confusing API, both internally and externally (eg. get_timestamp knows way too much). Use separate arguments for separate data. --t should be -t, double dashes are for long options.
Thanks. I changed "--t" back to "-t". Do you have any way of simplifying the argument (via nargs or an alternative) so the user doesn't enter "-x 1 2 -t 00:00:05,00 00:00:10,00"? What should we do in this case? How would a user remove from {time,index} N to M without an implementation like this?
split() Optimization
Binary search is almost certainly gratuitous. Did you profile it against simple walking? No matter how many times I look at it, I can't work out why split operates the way it does. Why do you do a double append with index 0?
I've committed a new method that performs a single walk through.
Single Walkthrough Explained You can't calculate where the caption being split's index would go on the fly unless you hold the captions in an iterable. Here's an example of why:
The new method walks through the generator once. In order to ensure captions are sorted, we created a list of the new captions and keep track of the index. Once we arrive at the split index, we are able to sort them and add the remaining captions.
remove_caption_index() and remove_caption_timestamp() Optimization As far as I'm aware, you can't turn a negative index into a positive one without knowing the length of the iterable. You also can't check whether index_one and index_two are valid (unless you remove stuff anyways).
You could create a remove by index method that uses a single walkthrough (provided in the profile), but then it will lose the checking and negative index functionality. It guarantees no index will be in between specified numbers but you may be mislead in thinking index 26 exist(ed) and it never did.
The whole idea behind RCT was getting the timestamps then removing by index. However I've created a new method which removes by timestamps. Looking at the profile statistics below, they are pretty much the same. Sometimes the bsearch method wins and sometimes the walk does. I went with the walk because it eliminates the need for util which means simpler and easier to read code.
All code used to profile can be found here: https://pastebin.com/u/switchupcb/1/eL4dBf1u where n = number of subtitles in a generator, s = number of subtitles effected, run = number of times function is ran
Keep in mind that a new generator needs to be created for every run which inflates the times.
isinstance()
There's way too much isinstance sprinkling around -- it's necessary in some edge cases, but by and large the code shouldn't be type-generic.
There is only two cases of isinstance() now. With that being said, here were the results:
n = 100,000, list(), run = 10000
Range: 26.6974238
List: 2.7328538999999985
Generator: 0.009768199999999894
n = 100,000, isistance() else list(), run = 10000
Range: 2.740842200000003
List: 0.0012198999999952775
Generator: 0.010537000000006458
split()
n = 5, s = 1, run = 1000000
list, bsearch, timsort: 1.0983209
walk, timsort: 0.3698865
walk: 0.28525639999999997
n = 100,000, s = 1, run = 1000000
list, bsearch, timsort: 1.9422774
walk, timsort: 0.3613230000000005
walk: 0.32073970000000074
n = 100,000, s = 50,000, run = 1000000
list, bsearch, timsort: 2.5298762999999997
walk, timsort: 0.34685589999999955
walk: 0.28242230000000035
remove_caption_index()
n = 5, s = 1, run = 1000000
deque, sort: 0.5850082999999999
list, walk: 0.5416839000000001
walk (no index check or negative index): 0.5663668
n = 100,000, s = 1, run = 1000000
deque, sort: 0.6403641
list, walk: 0.5941318999999998
walk (no index check or negative index): 0.6067177999999998
n = 100,000, s = 50,000, run = 1000000
deque, sort: 0.6180734000000001
list, walk: 0.5740768999999997
walk (no index check or negative index): 0.6000827000000006
remove_caption_timestamp
## run 1
n = 5, s = 1, run = 1000000
list bsearch(x2) rci : 0.6162214
walk: 0.6071775999999999
n = 100,000, s = 1, run = 1000000
list bsearch(x2) rci : 0.6469964000000001
walk: 0.6807417999999998
n = 100,000, s = 50,000, run = 1000000
list bsearch(x2) rci : 0.6708639999999999
walk: 0.6476633
## run 2
n = 5, s = 1, run = 1000000
list bsearch(x2) rci : 0.6819531999999999
walk: 0.6077996
n = 100,000, s = 1, run = 1000000
list bsearch(x2) rci : 0.6091927000000001
walk: 0.6199763
n = 100,000, s = 50,000, run = 1000000
list bsearch(x2) rci : 0.6019388999999999
walk: 0.6211809000000001
Tests How do we fix the unit test import?
LGTM
I can fix the 5 alerts with the next commit. 4 alerts are due to me calling next() in a generator. I could change those I can add a try except to each.for i in range(n)
loops to for subtitle in subs
if necessary. Do you see a better way of doing this?
The other fix is simple. Make the 'timestamp_two' parameter default to datetime.timedelta(0)
.
I'm likely to be quite busy for a while, so it's unlikely this will be reviewed fully in a timely manner, but the code quality looks better, thanks.
More general suggestions:
srt lines-matching
with a timestamp helper where you can do (for example) srt subs-matching -f 'sub.start < ts("00:08,100") and sub.index < 4'
.Parser
- Don't use a single argument which then relies on consecutive positionals. Use --t-start and --t-end, or similar, and ditto for the index. The current calling convention is really unlike what's standard in Unix and isn't very intuitive.
Using --start, --t1 and --end,--t2 like srt linear-timeshift.
- Don't overcomplicate a single program in order to have it do everything. It's much more generally useful to have something like srt lines-matching with a timestamp helper where you can do (for example) srt subs-matching -f 'sub.start < ts("00:08,100") and sub.index < 4'.
srt-lines-matching doesn't do this and there's no example of this. Plus, it seems more complicated. You should commit that change as you have a better grasp of Unix Conventions.
Unanswered Questions
Future Once reviewed, we can add the insert option before/after the pull. Please prioritize the review within 3 weeks for me to ideally add a srt-copy once pulled.
srt-lines-matching doesn't do this and there's no example of this.
Well, that's the point: having a generic utility to do this kind of filtering at the subtitle level rather than the line level would achieve what you want without having to encode all possible combinations of logic ahead of time. All of the plumbing is there in srt-lines-matching, it just now has to be done at the Subtitle object level.
You should commit that change as you have a better grasp of Unix Conventions.
Please prioritize the review within 3 weeks for me to ideally add a srt-copy once pulled.
I appreciate you contributing to srt, but speaking as an open source maintainer who does this in his spare time, it's not up to you to dictate my schedule. :-)
The commit as it is still needs a lot of work, and as well as you creating it, that realistically requires a lot of time commitment from my side to provide all the feedback. I wouldn't count on this being merged in that time period unless there are changes to a less complex approach similar to lines-matching
as documented above.
How do we fix the unit test import
You need to use imp.load_source
(or whatever the portable 2/3 equivalent is), since this has no extension.
In the current commit, the user simply enters an index and/or timestamp they desire to remove from/to. This takes one action and does not require any checking of files. There is no "encoding of all possible combinations of logic". This implementation is no different from srt_timestamp_to_milliseconds
in srt-linear-timeshift
.
The design is pretty simple:
Don't overcomplicate a single program in order to have it do everything. It's much more generally useful to have something like srt lines-matching with a timestamp helper where you can do (for example) srt subs-matching -f 'sub.start < ts("00:08,100") and sub.index < 4'.
I must be misunderstanding this suggestion because it seems more complicated in any of 3 ways I'm interpreting it.
1. A timestamp helper In this suggestion, the user is using another command with long syntax in order to find a timestamp or index for a subtitle. This doesn't seem simpler than the current method.
2. An equivalent functionality to lines-matching Lines matching is ambiguous. You would think the emphasis on 'lines' means that the function edits the whole srt file (which has its uses) until you realize it yields every subtitle. The '-s' argument is actually supposed to be '--per-content' (of each subtitle). Nonetheless, the idea of lines-matching is running every subtitle through a function, and setting all the non-matching subtitles content empty (rather than returning matching ones).
Therefore, a sort-of-equivalent 'subtitle' matcher should take a function, and return the matching subtitles (ideally removing the non-matching ones). So in your example, you'd end up with all the subtitles that start before "00:08,100" and before index 4. This is great because you wanted to remove all subtitles after "00:08,100"; except you now have subtitles which still overlap "00:08,100" when you wanted none to be there.
You weren't supposed to find matching subtitles, you were supposed to remove the ones at a certain point(s). So all of this for a functionality that doesn't do what we actually want it to do.
3. A search/query Otherwise, you're asking me to implement a whole search/query functionality. That pull request would take like a year. Both of us can't prioritize it (I've only got 3 weeks). Not to mention, lines_matching is already supposed to be this functionality, but it's only for subtitle content. Wouldn't it be better to upgrade it?
While your previous suggestions assisted in simplifying the code, this one does not. The current suggestion adds complexity; otherwise, warrants a better explanation. I'm not adding more complexity or functionalities just to end up at the same position we're currently in. I'm asking you to review because we're starting to assume complexity where there isn't.
I do understand your concerns for readability vs. usability. I'm not convinced the code isn't readable in its current state. Please guide me into the decision you want me to make regarding this dilemma.
I'm on holiday for a few weeks from the end of this one, so I won't have much time to comment, but just to come back:
Therefore, a sort-of-equivalent 'subtitle' matcher should take a function, and return the matching subtitles (ideally removing the non-matching ones). So in your example, you'd end up with all the subtitles that start before "00:08,100" and before index 4. This is great because you wanted to remove all subtitles after "00:08,100"; except you now have subtitles which still overlap "00:08,100" when you wanted none to be there.
I asked before but I don't feel like there was much of an answer: why should that matter? Who are the tangible people that need that?
Not to mention, lines_matching is already supposed to be this functionality, but it's only for subtitle content.
Huh? lines-matching
is not supposed to have this functionality at all. Let me just point you at this. :-)
I asked before but I don't feel like there was much of an answer: why should that matter? Who are the tangible people that need that?
Specific use case from comment 6.
Context I use captioning with music. Automated tools aren't good enough to caption music (even without the instrumental). What I'll do is place the vocals and lyrics in an automated tool such as YouTube's and then adjust the
timescaptions it provides. This means I must remove captions, copy choruses, etc.With Adobe Premiere's new update you can remove mass captions but you can't copy them. You can't remove by "insert". ~Soon, they will have an~ They have accurate speech to text tool (better than YouTube's). However, no program has the other features (remove by insert, copy, etc).
Also, this is called srt-remove, not srt-select. There are many reasons you would want to remove a caption or to remove ALL captions within a time period (without keeping the others).
lines-matching is not supposed to have this functionality at all. Let me just point you at this. :-)
https://man7.org/linux/man-pages/man1/grep.1.html lines-matching
doesn't match lines, it matches (subtitle) content.
A curious question I had while using the importlib: Why are there no extensions on each functionality? I can see why the files have '-' so the programmer knows they can't be used as modules (thus not importable). When would we do this in other programs?
Hey mate, I know your intentions are good, but I'm afraid I've basically run out of the limited supply of patience and energy I have to review this. The questions and statements made here and in other issue tickets are just so fundamentally at odds with how srt is designed, and at this point we're just going around in circles.
I mean, just look at this example in the readme that shows what "modularity" really means:
$ srt lines-matching -m hanzidentifier -f hanzidentifier.has_chinese -i chs+eng.srt |
> srt fixed-timeshift --seconds -5 |
> srt mux --input - --input fra.srt
Ultimately, like the vast majority of open-source projects, this is a passion project that follows its author's own preferences, and we're going around in circles here. I wish you all the best in your future endeavours, though. :-)
Am I supposed to follow UNIX or your standards? I've already copied your code style for this method. I'm not recreating a method that doesn't do what this is supposed to do. There's obvious reasons why you'd remove a subtitle...
What I don't understand though, is why this is only modular to UNIX (23% of people) and not python (100%). For example, numpy doesn't require you to run shell commands to manipulate numbers in your program. Perhaps, that's where we disagree: I have much to say about Linux...
As I've said, your standards are followed pretty clearly in it. If you are misinterpreting a line, you should probably mark it in the review (so I can fix it). If the project isn't actually being maintained (due to a lack of "patience and energy"), you should say that in the project description or README. If you need me to DoorDash you "patience and energy", please let me know.
With that being said...
Parsing simplified. No Index. No timestamp method. Does literally one thing. UNIX. Issue #72 updated. The file has 2 methods not including the main function. Follows all patterns of every other functionality. There's only 120 lines to actually review; it still follows your template plus many lines are comments.
I'm not recreating a method that doesn't do what this is supposed to do. There's obvious reasons why you'd remove a subtitle...
This isn't a discussion about why you'd want to remove a subtitle. Read the above again:
I asked before but I don't feel like there was much of an answer: why should that matter [that subtitles overlap the boundary]? Who are the tangible people that need that?
To which you replied look at comment 6, but I've read that three times now and there's still nothing to be found.
What I don't understand though, is why this is only modular to UNIX (23% of people) and not python (100%).
I have no interest in producing a confused library which doesn't know what its intentions or design goals. It's really that simple. srt is tested on other platforms to support library users, but that's as far as it's gonna go. I have no time or interest in spending the ongoing time to support anything more than that.
There's no requirement to contribute to this or any codebase.
For example, numpy doesn't require you to run shell commands to manipulate numbers in your program.
This isn't numpy, whether in terms of design, intent, or in terms of number of contributors. The comparison just doesn't make any sense.
If the project isn't actually being maintained (due to a lack of "patience and energy"), you should say that in the project description or README.
You really won't get far contributing to open source with an attitude where you confuse "my pull request met with scrutiny" with "this project is unmaintained". Here's how you can tell the difference: on an unmaintained project, you don't get scrutiny: you get absolutely nothing. :-)
If you need me to DoorDash you "patience and energy", please let me know.
Mate, let me offer some final, sincere, advice: being cheeky to people who are being straight with you is a really poor strategy. See ya.
Added the srt-remove functionality. Unit tests included but commented.
srt-remove
Caption tools lack a mass-removal function. This commit will allow us to remove multiple captions by index/timestamp.
Future Development I'd want to make sure the function is merged by your standards prior to further development (such that I can get a better grasp of how to organize each function). I know that methods such as binary_search, contains_timestamp, captions_containing_timestamp may serve better in a util file. Perhaps, the split function could use it's own file.
After this, we could add an argument in combination with 'fixed-timeshift' to specify whether removals are in-place. Currently, all removals are in place. This would allow us to remove and adjust times at the same time.
Input Originally, I wanted the input for t1 and t2 to be interchange-able by index - timestamp. For example, an input could be '10' and '00:00:50,00' which would remove from index 10 to :50. This is possible by using get_timestamp to convert 10 to a time. However, I'm still figuring out argparse.
Unit Tests The functions are tested and confirmed to work. I could not get the module 'srt-remove' to import in the project so perhaps you can fix that?
Remove Method Explanation
Edge Cases An explanation of each edge case for remove_timestamps(). Keep in mind that reverse removals (where t2 > t1) are allowed, but during the calculation (code that determines the indexes) t1 and t2 are solely sequential.
Sequential Edge Cases
Captions are only removed from t1 - t2. If t2 occurs before or at the FIRST CAPTION, t1 must also occur before any captions. Therefore, no captions will be removed. As captions are sorted by start time, this edge case can be checked instantly.
Captions are only removed from t1 - t2. If t1 occurs after or at the LAST CAPTION, t2 must also occur after any captions. Therefore, no captions will be removed. Captions are sorted by start times (and NOT end times). Therefore, we can't instantly evaluate this edge case without knowing the LAST CAPTION.
Since t1 < t2 and captions are only removed from t1 - t2, if t1 occurs after or at the end of the LAST CAPTION TO BE REMOVED, no captions will be removed. This case can be checked if the LAST CAPTION INDEX is found.
All caption are removed from t1 - t2, when t1 < all captions and all captions are < t2. Again, captions aren't sorted by end times. Therefore, we have no way of confirming that the LAST CAPTION TO BE REMOVED is actually the very LAST CAPTION by index. However, we can by knowing the nearest captions.
Non-Sequential Edge Cases
Captions are only removed from 0 - t2 and t1 - end. This means 2 conditions must be met for no captions to be removed:
All caption are removed from 0 to t2, and t1 to the end. Since t2 < t1, all captions will be removed.
What about the other inverses of the non-sequential cases? In a non-sequential case, any captions that occur after t1 ARE removed. Even if t2 occurs directly before t1, there must be no captions after t1. As you can see, this means t1 and t2 are not dependent on each other. Even if t2 occurs at 0, t1 may still have captions to remove. Even if t1 occurs at the end, t2 may have captions to move. It's only if each have no captions in their direction, that none will be removed.
FIRST CAPTION TO BE REMOVED Since captions are sorted by start time, we can ALWAYS determine 'if t2 <= subtitles[0].start'. This is important because we must determine the FIRST CAPTION INDEX for the first bound of our range.
We find the FIRST CAPTION TO BE REMOVED by finding the [nearest contained captions] after t1. Specifically, the right-most nearest caption. Therefore, the FIRST CAPTION TO BE REMOVED is always at/after t1. Likewise, t1 always occurs at/before the FIRST CAPTION TO BE REMOVED.
We can determine 'if subtitles[-1] <= t1' if there is NO nearest-caption to the right of t1. We can determine (non-sequentially) 'if t2 <= subtitles[0].start and subtitles[-1].end <= t1' by the same reasoning.
If t1 is after the last caption, the following edge case is TRUE: 'if subtitles[t2].end <= t1'
LAST CAPTION TO BE REMOVED Since captions aren't sorted by end times, we can't simply check if the LAST CAPTION <= t2 by index (i.e subtitles[-1]). For the same reason, we also can't use the next caption (subtitles[t2 + 1]) as frame of reference. Instead, the [nearest captions] provide us a frame of reference.
We find the LAST CAPTION TO BE REMOVED by finding the [nearest captions] before t2. Specifically, the left-most nearest caption. Therefore, the LAST CAPTION TO BE REMOVED is always at/before t2. Likewise, t2 always occurs at/after the LAST CAPTION TO BE REMOVED.
We can determine 'if t1 <= subtitles[0].start and subtitles[-1].end <= t2' (remove all) if there is NO nearest-caption to the right of t2.
Nearest Caption (Not Included)
I wrote a nearest caption function but realized I could simply use the binary_search to find the nearest values. The nearest caption function simply takes the subtitles and a timestamp to find the nearest caption before/after (to the left/right of the) provided timestamp. These are returned in a 2D array such that you can determine the nearest caption on the left and right-hand side, and whether or not the timestamp even has captions before/after them (by empty array). I also finished testing the method. Where would I commit this?