Edit distance is well-defined over arrays of any kind of categorical values (like integers standing in for characters or strings), not just strings. However, the implementation of edit distance is hard-coded to only accept strings or lists of strings.
Motivation
First, I'd like to use the edit distance operators in this library but converting tensors back into strings is a quite expensive operation.
Second, in my implementation, the tables needed to do said conversion are not currently local to the module that would hold the torchmetric metric objects either, so it would add additional bookkeeping as well.
Third and finally, I don't even want to do comparison based on the number of edits as computed via Unicode strings but rather based on arbitrary symbols which may be multi-character. For instance, if we treat strings as nominal attributes, then the edit distance for:
gold: ["this", "is", "fun"]
hypothesis: ["this", "isn't", "fun"]
is 1, but we get a different result (3) if we join the strings together and use EditDistance:
gold: "this is fun"
hypothesis: "this isn't fun"
I would like the ability to get the former measure and keep the computations on the right device.
Pitch
Currently the update function is typed to accept Union[str, List[str]] and raises an error if this is not the case. I would remove this restriction so that pairs of 1d and 2d integral tensors are also accepted, and would be interpreted the same way as a pair of strings and a pair of lists of strings would be, respectively. Other than that preprocessing, the checks on type would be loosened there and in the lower-level implementation here.
I would be willing to take a stab at this for EditDistance; I am not volunteering to do it for all the other text metrics however.
Alternatives
Of course I could always just roll my own metric and keep it to myself.
Additional context
I wonder if this is going to be miserable on CUDA since there's nothing CUDAish about the edit distance algorithm's very iterative implementation.
🚀 Feature
Edit distance is well-defined over arrays of any kind of categorical values (like integers standing in for characters or strings), not just strings. However, the implementation of edit distance is hard-coded to only accept strings or lists of strings.
Motivation
First, I'd like to use the edit distance operators in this library but converting tensors back into strings is a quite expensive operation.
Second, in my implementation, the tables needed to do said conversion are not currently local to the module that would hold the
torchmetric
metric objects either, so it would add additional bookkeeping as well.Third and finally, I don't even want to do comparison based on the number of edits as computed via Unicode strings but rather based on arbitrary symbols which may be multi-character. For instance, if we treat strings as nominal attributes, then the edit distance for:
["this", "is", "fun"]
["this", "isn't", "fun"]
is 1, but we get a different result (3) if we join the strings together and use
EditDistance
:"this is fun"
"this isn't fun"
I would like the ability to get the former measure and keep the computations on the right device.
Pitch
Currently the update function is typed to accept
Union[str, List[str]]
and raises an error if this is not the case. I would remove this restriction so that pairs of 1d and 2d integral tensors are also accepted, and would be interpreted the same way as a pair of strings and a pair of lists of strings would be, respectively. Other than that preprocessing, the checks on type would be loosened there and in the lower-level implementation here.I would be willing to take a stab at this for
EditDistance
; I am not volunteering to do it for all the other text metrics however.Alternatives
Of course I could always just roll my own metric and keep it to myself.
Additional context
I wonder if this is going to be miserable on CUDA since there's nothing CUDAish about the edit distance algorithm's very iterative implementation.