Add support for "human format"

Ethkuil commented 1 year ago

close #7

fireattack commented 1 year ago

Thanks.

I can merge that if it only changes the regex to support "human format" up to https://github.com/fireattack/chapter_converter/pull/8/commits/9dff4f62a34bfc47738fa0cb7d54aeb81b5a87f4.

But I don't think it's a good idea to refractor it dramatically, especially when you introduce breaking change(s).

And some syntax/style choices are intentional (e.g. keep all the functions in one file; use one-liner for parser.add_argument(); not use walrus operator to be compatible with Py3.7 etc.).

fireattack commented 1 year ago

Anyway, my philosophy for a public-facing script is that, if it works fine, I will not refactor for the sake of it (it's super short either way).

But if you introduce major feature which benefits from refactoring, feel free to bring it in!

Ethkuil commented 1 year ago

Oh, I'm sorry for making many refactors prior to communicating with you. In fact the main feature I would like to add doesn't rely on the refactor🤣.

You can revert some unsuitable refactors, or I will do that if you would like.

fireattack commented 1 year ago

Thanks!

I made some changes:

I don't want to mess with clipboard if an output is specified. So I reverted that part.
The "human format" regex is too relex IMO. Particularly, it shouldn't allow string that has no clear delimiter between timestamp and chapter name (e.g. 43:0001.chapter1). It just asks for trouble. So I changed it to only accept string that are clearly separated by \s+ or \s*,\s*.

Ethkuil commented 1 year ago

2. The "human format" regex is too relex IMO. Particularly, it shouldn't allow string that has no clear delimiter between timestamp and chapter name (e.g. 43:0001.chapter1). It just asks for trouble. So I changed it to only accept string that are clearly separated by \s+ or \s*,\s*.

I don't think it just asks for trouble. Just as its name indicates, it tries to deal with input manually inputted by human(for example copied from the profile/a comment of a video). And indeed I find several times that people write line like 23:10MC, and certainly other people can understand.

fireattack commented 1 year ago

I get that. I'm saying that's something the user needs to fix themselves. Because it could be ambiguous (43:00.121 chapter1 could be 43:00.12, 1 chapter 1 or 43:00.121, chapter 1). I prefer throw error in such case rather than parse it in a way that may surprise the user. I don't want to encourage such usage, anyway.

Also Just in case it's not clear. "Asks for trouble" is a figure of speech which means it is likely to incur problems in future. I'm not saying this feature itself is a "trouble".

Ethkuil commented 1 year ago

Because it could be ambiguous (43:00.121 chapter1 could be 43:00.12, 1 chapter 1 or 43:00.121, chapter 1).

Is this case, the parsing result will be undoubtedly the same as 43:00.121, chapter 1.

HUMAN_RE = r"(?P<time>[0-9]+:[0-9]{1,2}[0-9:.]*)([\s,]+(?P<name1>.+)|(?P<name2>[^\d:.\s,].+))" # only one of name1 and name2 will be matched

It'll first look for [\s,]+ as delimiters, and if found then it won't be ambiguous. If not found, then it'll try to take another character as the beginning of name part, and this character won't appear in time part or be common delimiters. It's [^\d:.\s,]. Maybe it's not ambiguous because there is always only one way to understand that makes sense for human.

Ethkuil commented 1 year ago

And certainly I agree that it should not be encouraged to use "no specific delimiter". But I think people won't write 43:00.121 chapter 1 and think it means 43:00.12,1 chapter 1. So currently I don't worry about that.

If anything has been overlooked in error, please point it out.

fireattack commented 1 year ago

Any sane csv/tsv parser won't even allow more than one consistent delimiter throughout the file, and that's for a reason.

For example, with the existing code, the following two:

17:02\t,I'm a title started with a comma

17:02,,I'm a title started with a comma

will be correctly recognized as name = ,I'm a title started with a comma. But the regex you provided will dismiss the leading comma.

A more tricky case is names start with a space. The old SIMPLE_RE indeed can't deal with it either (since I use , *), but if the user feed in a file that is properly delimited with tab, it can have leading space in names, too.

Of course, one can always argue which use case is more "common"; but the bottom line is, people who use the more "standard" format shouldn't be punished.

Anyway, here is my final compromise:

I will allow no delimiter.
I will sacrifice the support for having leading space in tab-delimited format.
But if you want to delimit the input with comma, it must be immediately after the timestamp, just like before. This way, things like
```
17:02,,I'm a title started with a comma
17:02\t,I'm a title started with a comma
```
will still be recognized as name = ,I'm a title started with a comma.

The new regex is

(?P<time>\d+:\d{1,2}[0-9:.]*)(,?\s*)(?P<name>.+)

You don't really need the name2 part. Because the front/left parts are all greedy, they will match up to the delimiter just fine. [^\d:.\s,] isn't needed.

Please let me know what you think before we merge.

Ethkuil commented 1 year ago

It's a perfect solution for me😄. Thank you for your patience.

I think the version with new restriction is closer to my thought compared to my previous version.

fireattack / chapter_converter

Add support for "human format" #8