Open tom-p-reichel opened 1 year ago
Nice idea, but I do not believe that every identifier has a corresponding loc
structure, and if it does exist, its relative location varies depending on the syntax.
I could be misremembering, though. I do know that we intended to use such location information from what was available in the cache, though I think it relied upon infilling the Sexps and inferring missing locations.
The cached sentences I looked at are a pretty small sample size, but I never saw this produce something that didn't make sense. I can add a runtime assertion that the substring of the original text this identifies is a suffix of the fully qualified name and run extraction on many files, which I think would be a pretty good indicator if this works in general (however I am currently only working with coq8.13).
This doesn't seem to interact with any notations, which I expect would have very confusing locations, it only seems to catch written idents.
For special behavior while training a model, it helps to know where a qualified identifier is actually located in the sentence it is associated with.
I suggest the following diff, which extracts the associated
loc
structure for each qualified identifier from the sexp and adds a(start,end)
left-inclusive tuple to the qualified identifier structure, wherestart
andend
are character counts within the corresponding sentence.