How do we character ranges of the clusters

sudarshansivakumar commented 1 year ago

Right now, when we call predictor.predict() we get the clusters as a list of lists, and the cluster heads along with their token indices. Is it possible to :

Get the cluster heads as character ranges? Meaning that 'cluster_heads': {'Momofuku Ando': [4, 5], 'Osaka': [12, 12], 'instant noodles': [9, 10], 'Many students': [22, 23], 'Nissin': [18, 18]}}, instead of the token/word indices like [4,5], [12,12], etc. can we get the character ranges
Alternatively, can we get a separate variable that maps the token indices to tokens? Something like ['Do', 'not', 'forget', 'about'....] . I tried looking at how the text is tokenized but couldn't exactly get that from the code. Basically for my application I need to check whether a coreference appears in a particular character range, and would like to do that accurately (with the best way to do that being using the character range)

davidberenstein1957 commented 1 year ago

That would be possible. @Masboes, this would be a good first issue to pick up.

shmouelsamares commented 1 year ago

in spacy you could use the token.idx property to get the token's first character index. Then token.idx + len(token) to get the last character index. Is it useful ?

davidberenstein1957 commented 1 year ago

@sudarshansivakumar @shmouelsamares you can see a working example here.

davidberenstein1957 / crosslingual-coreference

How do we character ranges of the clusters #17