Closed Ethkuil closed 1 year ago
Thanks.
I can merge that if it only changes the regex to support "human format" up to https://github.com/fireattack/chapter_converter/pull/8/commits/9dff4f62a34bfc47738fa0cb7d54aeb81b5a87f4.
But I don't think it's a good idea to refractor it dramatically, especially when you introduce breaking change(s).
And some syntax/style choices are intentional (e.g. keep all the functions in one file; use one-liner for parser.add_argument()
; not use walrus operator to be compatible with Py3.7 etc.).
Anyway, my philosophy for a public-facing script is that, if it works fine, I will not refactor for the sake of it (it's super short either way).
But if you introduce major feature which benefits from refactoring, feel free to bring it in!
Oh, I'm sorry for making many refactors prior to communicating with you. In fact the main feature I would like to add doesn't rely on the refactor🤣.
You can revert some unsuitable refactors, or I will do that if you would like.
Thanks!
I made some changes:
43:0001.chapter1
). It just asks for trouble. So I changed it to only accept string that are clearly separated by \s+
or \s*,\s*
.2. The "human format" regex is too relex IMO. Particularly, it shouldn't allow string that has no clear delimiter between timestamp and chapter name (e.g.
43:0001.chapter1
). It just asks for trouble. So I changed it to only accept string that are clearly separated by\s+
or\s*,\s*
.
I don't think it just asks for trouble. Just as its name indicates, it tries to deal with input manually inputted by human(for example copied from the profile/a comment of a video). And indeed I find several times that people write line like 23:10MC
, and certainly other people can understand.
I get that. I'm saying that's something the user needs to fix themselves. Because it could be ambiguous (43:00.121 chapter1 could be 43:00.12, 1 chapter 1
or 43:00.121, chapter 1
). I prefer throw error in such case rather than parse it in a way that may surprise the user. I don't want to encourage such usage, anyway.
Also Just in case it's not clear. "Asks for trouble" is a figure of speech which means it is likely to incur problems in future. I'm not saying this feature itself is a "trouble".
Because it could be ambiguous (43:00.121 chapter1 could be 43:00.12, 1 chapter 1 or 43:00.121, chapter 1).
Is this case, the parsing result will be undoubtedly the same as 43:00.121, chapter 1
.
HUMAN_RE = r"(?P<time>[0-9]+:[0-9]{1,2}[0-9:.]*)([\s,]+(?P<name1>.+)|(?P<name2>[^\d:.\s,].+))" # only one of name1 and name2 will be matched
It'll first look for [\s,]+
as delimiters, and if found then it won't be ambiguous. If not found, then it'll try to take another character as the beginning of name
part, and this character won't appear in time
part or be common delimiters. It's [^\d:.\s,]
. Maybe it's not ambiguous because there is always only one way to understand that makes sense for human.
And certainly I agree that it should not be encouraged to use "no specific delimiter". But I think people won't write 43:00.121 chapter 1
and think it means 43:00.12,1 chapter 1
. So currently I don't worry about that.
If anything has been overlooked in error, please point it out.
Any sane csv/tsv parser won't even allow more than one consistent delimiter throughout the file, and that's for a reason.
For example, with the existing code, the following two:
17:02\t,I'm a title started with a comma
17:02,,I'm a title started with a comma
will be correctly recognized as name = ,I'm a title started with a comma
. But the regex you provided will dismiss the leading comma.
A more tricky case is names start with a space. The old SIMPLE_RE indeed can't deal with it either (since I use , *
), but if the user feed in a file that is properly delimited with tab, it can have leading space in names, too.
Of course, one can always argue which use case is more "common"; but the bottom line is, people who use the more "standard" format shouldn't be punished.
Anyway, here is my final compromise:
17:02,,I'm a title started with a comma
17:02\t,I'm a title started with a comma
will still be recognized as name = ,I'm a title started with a comma
.
The new regex is
(?P<time>\d+:\d{1,2}[0-9:.]*)(,?\s*)(?P<name>.+)
You don't really need the name2 part. Because the front/left parts are all greedy, they will match up to the delimiter just fine. [^\d:.\s,]
isn't needed.
Please let me know what you think before we merge.
It's a perfect solution for me😄. Thank you for your patience.
I think the version with new restriction is closer to my thought compared to my previous version.
close #7