Determine where all qualified identifiers are in the source text.

tom-p-reichel commented 1 year ago

For special behavior while training a model, it helps to know where a qualified identifier is actually located in the sentence it is associated with.

I suggest the following diff, which extracts the associated loc structure for each qualified identifier from the sexp and adds a (start,end) left-inclusive tuple to the qualified identifier structure, where start and end are character counts within the corresponding sentence.

diff --git a/prism/interface/coq/ident.py b/prism/interface/coq/ident.py
index 715d973..6607e20 100644
--- a/prism/interface/coq/ident.py
+++ b/prism/interface/coq/ident.py
@@ -54,24 +54,20 @@ _serqualid_contexts = [
 _serqualid_contexts = regex_from_options(_serqualid_contexts, False, False)
 serqualid_re = re.compile(
     rf"{_serqualid_contexts.pattern}?\(\s*Ser_Qualid\s*{dirpath_re.pattern}\s*"
-    rf"{id_re('str_of_qualid').pattern}\s*\)")
+    rf"{id_re('str_of_qualid').pattern}\s*\).*?") #TODO: properly finish this regex.
+# as of now it just scans until we find the closest loc in the AST.
 loc_re = re.compile(
-    r"\(\s*loc\s*\(\(\(\s*fname\s+ToplevelInput\)\s*\(\s*line_nb\s+\d+\)\s*"
-    r"\(\s*bol_pos\s+\d+\)\s*\(\s*line_nb_last\s+\d+\)\s*\(\s*bol_pos_last\s+\d+\)\s*"
-    r"\(\s*bp\s+\d+\)\s*\(\s*ep\s+\d+\)\)\)\)")
+    r"\(\s*loc\s*\(\(\(\s*fname\s+ToplevelInput\)\s*\(\s*line_nb\s+(?P<line_nb>\d+)\)\s*"
+    r"\(\s*bol_pos\s+(?P<bol_pos>\d+)\)\s*\(\s*line_nb_last\s+(?P<line_nb_last>\d+)\)\s*\(\s*bol_pos_last\s+(?P<bol_pos_last>\d+)\)\s*"
+    r"\(\s*bp\s+(?P<bp>\d+)\)\s*\(\s*ep\s+(?P<ep>\d+)\)\)\)\)")
 lident_re = re.compile(
-    rf"\(\s*v\s*{id_re('str_of_lident').pattern}\s*\)\s*"
-    rf"(?={loc_re.pattern})")
+    rf"\(\s*v\s*{id_re('str_of_lident').pattern}\s*\)\s*")
 lname_re = re.compile(
-    rf"\(\s*v\s*\(\s*Name\s*{id_re('str_of_lname').pattern}\s*\)\s*\)\s*"
-    rf"(?={loc_re.pattern})")
-ident_re = regex_from_options(
-    [serqualid_re.pattern,
-     lident_re.pattern,
-     lname_re.pattern],
+    rf"\(\s*v\s*\(\s*Name\s*{id_re('str_of_lname').pattern}\s*\)\s*\)\s*")
+ident_re = re.compile(regex_from_options(
+    [lname_re.pattern,lident_re.pattern,serqualid_re.pattern],
     False,
-    False)
-
+    False).pattern+loc_re.pattern)

 class IdentType(enum.Enum):
     """
@@ -138,6 +134,10 @@ class Identifier(NamedTuple):
     """
     The identifier itself.
     """
+    position: tuple[int,int] = None
+    """
+    The position of the identifier within the source text.
+    """

 def ident_of_match(ident: re.Match) -> Identifier:
@@ -179,7 +179,8 @@ def ident_of_match(ident: re.Match) -> Identifier:
         assert lname is not None
         ident_str = unquote(lname)
         ident_type = IdentType.lname
-    return Identifier(ident_type, ident_str)
+    position = (int(ident["bp"]),int(ident["ep"]))
+    return Identifier(ident_type, ident_str, position=position)

 def id_of_str(txt: str) -> str:
@@ -293,7 +294,7 @@ def qualify_ident(
                 if fully_qualified.startswith(top_prefix):
                     fully_qualified = modpath + fully_qualified[len(toppath):]
                 global_id_cache[ident_str] = fully_qualified
-    return Identifier(ident_type, fully_qualified)
+    return Identifier(ident_type, fully_qualified, position=ident.position)

 def _get_all_idents(

a-gardner1 commented 1 year ago

Nice idea, but I do not believe that every identifier has a corresponding loc structure, and if it does exist, its relative location varies depending on the syntax.

I could be misremembering, though. I do know that we intended to use such location information from what was available in the cache, though I think it relied upon infilling the Sexps and inferring missing locations.

tom-p-reichel commented 1 year ago

The cached sentences I looked at are a pretty small sample size, but I never saw this produce something that didn't make sense. I can add a runtime assertion that the substring of the original text this identifies is a suffix of the fully qualified name and run extraction on many files, which I think would be a pretty good indicator if this works in general (however I am currently only working with coq8.13).

This doesn't seem to interact with any notations, which I expect would have very confusing locations, it only seems to catch written idents.

Radiance-Technologies / prism

Determine where all qualified identifiers are in the source text. #60