Closed Masboes closed 1 year ago
Hi, the PR looks great however, in this case, I would opt for 2 more return keys "clusters_char" and "cluster_heads_chars" or even better, replace the old keys to deprecate this behaviour and go for a more tokenization-agnostic implementation.
Thanks for taking a look. Returning the chars in separate return keys makes sense to me. Wouldn't your second suggestion just be my PR but with char_indices
forced to be on? I imagine that's quite the breaking change. Maybe I misunderstand?
Yes it will be a breaking change but also allow for a more uniform integration with non spacy tokenization. I prefer to have one good working default option that can be used for both instead of having everything separate and endlessly configurable. On 16 Mar 2023, at 10:03, Mathijs B @.***> wrote: Thanks for taking a look. Returning the chars separately makes sense to me. Wouldn't your second suggestion just be my PR but with char_indices forced to be on? I imagine that's quite the breaking change. Maybe I misunderstand?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were assigned.Message ID: @.***>
Sounds fine to me. Shall I then just remove the option and make it so it always returns char indices? I can clean the code up a bit doing that. I'll also update the readme to reflect this change.
@davidberenstein1957 I've implemented your suggestions, let me know if this is what you had in mind. If so, I'll update the readme.
@davidberenstein1957 any chance you can take a look at this pr?
Yes, I will take a look on Wednesday.On 3 Apr 2023, at 08:41, Mathijs B @.***> wrote: @davidberenstein1957 any chance you can take a look at this pr?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>
char_indices
parameter, default=falseImplements #17