ahrefs / atd

Static types for JSON APIs
Other
316 stars 53 forks source link

atdgen: menhir error #391

Open aryx opened 1 year ago

aryx commented 1 year ago

With this file:

(* New Semgrep syntax (hence the v2) specified using ATD instead of jsonschema.
 *
 * For more information on the new syntax, see:
 *  - Brandon's community Slack post announcing the new syntax
 *    https://semgrep.slack.com/archives/C018NJRRCJ0/p1698430726062769?thread_ts=1698350734.415849&cid=C018NJRRCJ0
 *  - Brandon's slides
 *    https://docs.google.com/presentation/d/1zzmyFbfNlJqweyzuuFlo4zpSs3Gqhfi6FiNRONSEQ0E/edit#slide=id.g1eee710cdbf_0_26
 *  - Pieter's video
 *    https://www.youtube.com/watch?v=dZUPjFvknnI
 *  - Parsia's blog post
 *    https://parsiya.net/blog/2023-10-28-semgreps-experimental-rule-syntax/
 *
 * Note that even if most Semgrep users use YAML to write a rule, and not JSON,
 * we still use a JSON tool (here ATD, but also jsonschema) to specify
 * the rule schema because YAML is a superset of JSON and can be
 * mechanically translated into JSON; there is no yamlschema
 * (see https://json-schema-everywhere.github.io/yaml). To add even more
 * confusion, a jsonschema can actually be specified using YAML (like in
 * rule_shema_v1.yml), and so one can use YAML syntax to specify the
 * JSON schema of files actually written in YAML (hmmm).
 *
 * Jsonschema is powerful but also arguably complicated and so it
 * might be simpler for many Semgrep developers (and also some Semgrep
 * users) to use ATD to specify and understand the schema of a rule.
 * It could provide a better basis to think about future syntax extensions.
 *
 * This file is now also used for some rule validation in
 * `semgrep --validate --develop`.
 *
 * Note that this file does not replace Parse_rule.ml nor Rule.ml. We still
 * want to accept the old syntax in Parse_rule.ml and also parse with
 * position information and error recovery which ATD does not provide.
 * This files does not replace either (yet) rule_schema_v1.yml which is
 * more complete.
 *
 * TODO:
 *  - taint
 *  - extract
 *  - r2c-internal-project-depends-on-content
 *  - secrets
 *  - steps (and join?)
 *  - generalized taint
 *  - new metavariable types
 *  - new 'anywhere:'
 *)

(*****************************************************************************)
(* Basic types and string aliases *)
(*****************************************************************************)

(* escape hatch *)
type raw_json <ocaml module="Yojson.Basic" t="t"> = abstract

(* ex: "*.c" *)
type glob = string

(* ex: "[a-zA-Z_]*\\.c" *)
type regex = string

(*****************************************************************************)
(* The rule *)
(*****************************************************************************)

type rule = {
     id: rule_id;

     message: string;
     severity: severity;

     (* TODO: selector vs analyzer *)
     languages: language list;

     (* CHECK: exactly one of those fields must be set *)
     ?match_ <json name="match">: formula option;
     ?taint: taint_spec option;
     ?extract: extract option;
     (* TODO: join, steps, secrets, sca *)

     ~mode <ocaml default="`Search">: mode;
     (* TODO: product: product *)

     (* TODO? could be replaced by a pattern-filename: *)
     ?paths: paths option;

     ?fix: string option;
     ?fix_regex: fix_regex option;

     ?metadata: raw_json option;
     ?options: rule_options option;

     ?version: version option;
     ?min_version: version option;
     ?max_version: version option;

     (* later: equivalences: ... *)
}

(* Rule_ID.t, "^[a-zA-Z0-9._-]*$" *)
type rule_id = string wrap <ocaml module="Rule_ID">

(* Version_info.t *)
type version = string (* TODO  wrap <ocaml module="ATDStringWrap.Version"> *)

type mode = [
  | Search <json name="search">
  | Taint <json name="taint">
  | Join <json name="join">
  | Extract <json name="extract">
  | SemgrepInternalPostprocessor <json name="semgrep_internal_postprocessor">
  (* TODO: Steps, SCA? *)
]

(*****************************************************************************)
(* Types of rule fields *)
(*****************************************************************************)

(* coupling: semgrep_output_v1.atd with match_severity *)
type severity = [
  | Error <json name="ERROR">
  | Warning <json name="WARNING">
  | Info <json name="INFO">
  (* should not be used *)
  | Experiment <json name="EXPERIMENT">
  | Inventory <json name="INVENTORY">
]

(* coupling: language.ml *)
type language = [
  (* programming (and configuration) languages *)
  | Apex <json name="apex">
  | Bash <json name="bash">
  | Sh <json name="sh">
  | C <json name="c">
  | Clojure <json name="clojure">
  | Cpp <json name="cpp">
  | CppSymbol <json name="c++">
  | Csharp <json name="csharp">
  | CsharpSymbol <json name="c#">
  | Dart <json name="dart">
  | Dockerfile <json name="dockerfile">
  | Docker <json name="docker">
  | Ex <json name="ex">
  | Elixir <json name="elixir">
  | Generic <json name="generic">
  | Go <json name="go">
  | Golang <json name="golang">
  | Hack <json name="hack">
  | Html <json name="html">
  | Java <json name="java">
  | Js <json name="js">
  | Javascript <json name="javascript">
  | Json <json name="json">
  | Jsonnet <json name="jsonnet">
  | Julia <json name="julia">
  | Kt <json name="kt">
  | Kotlin <json name="kotlin">
  | Lisp <json name="lisp">
  | Lua <json name="lua">
  | Ocaml <json name="ocaml">
  | Php <json name="php">
  | Python2 <json name="python2">
  | Python3 <json name="python3">
  | Py <json name="py">
  | Python <json name="python">
  | R <json name="r">
  | Ruby <json name="ruby">
  | Rust <json name="rust">
  | Scala <json name="scala">
  | Scheme <json name="scheme">
  | Solidity <json name="solidity">
  | Sol <json name="sol">
  | Swift <json name="swift">
  | Tf <json name="tf">
  | Hcl <json name="hcl">
  | Terraform <json name="terraform">
  | Ts <json name="ts">
  | Typescript <json name="typescript">
  | Vue <json name="vue">
  | Yaml <json name="yaml">

  (* not regular programming languages *)
  | Regex <json name="regex">
  | None <json name="none">
]

type paths = {
  ~include_ <json name="include">: glob list;
  ~exclude: glob list;
}

type fix_regex = {
  regex: regex;
  replacement: string;
  ?count: int option;
}

type rule_options <ocaml from="Rule_options" t="t"> = abstract

(*****************************************************************************)
(* Search mode (default) and formula *)
(*****************************************************************************)

(* 'formula' below is handled by a <json adapter.ocaml=...> because there is no
 * way to encode directly using ATD the way we chose to represent formulas
 * in YAML/JSON.
 *
 * old: this type was called new-pattern in rule_schema_v1.yaml
 *)

type formula = {
  (* CHECK: exactly one of those fields must be set *)
  (* either directly a string or pattern: string in the JSON *)
  ?pattern: string option;
  ?regex: regex option;
  ?all: formula list option;
  ?any: formula list option;
  (* check: not/inside/anywhere can appear only inside an all: *)
  ?not: formula option;
  ?inside: formula option;
  ?anywhere: formula option;
  (* TODO? Taint of taint_spec *)

  (* alt: we could instead do '?all: formula list option * condition list'
   * above, but syntactically we also allow 'where' with pattern:, regex:,
   * etc. as in
   *    { pattern: ..., where: ..., }
   * In fact that's the main reason we sometimes have to use pattern: string
   * instead of a string because where: could not be attached to it.
   *)
  ~where: condition list;
}
<json adapter.ocaml="Rule_schema_v2_adapter.Formula">

(* Just like for formula, we're using an adapter to transform
 * conditions in YAML like:
 *
 *  where:
 *   - metavariable: $X
 *     regex: $Z
 *
 * which when turned into JSON gives:
 *
 *  { where: [
 *     { metavariable: $X,
 *       regex: $Z
 *     }
 *   ] }
 * 
 * which we must transform in an ATD-compliant:
 *
 *  [ ["M", [{ metavariable: $X,
 *             regex: $Z
 *           }]
 *    ]]
 *)
type condition = [
  | Focus <json name="F"> of focus
  | Comparison <json name="C"> of comparison
  | Metavariable <json name="M"> of metavariable_cond
  ]
<json adapter.ocaml="Rule_schema_v2_adapter.Condition">

type focus = {
  (* either a single string or an array in JSON, that is
   * {focus: "$FOO"}, but also {focus: ["$FOO", "$BAR"]}
   *)
  focus: mvar list;
}

type mvar = string

type comparison = {
    comparison: string; (* expr *)
    ?base: int option;
    ~strip: bool;
  }

type metavariable_cond = {
  metavariable: mvar;
  (* CHECK: exactly one of those fields must be set *)
  ?type: string option;
  ?types: string list option;
  (* this covers regex:, pattern:, but also any formula.
   * TODO: for metavariable-regex, can also enable constant_propagation 
   * TOOD: we should accept also language: string
   *)
  ?analyzer: analyzer option;
}  

type analyzer = [
  | Entropy <json name="entropy">
  | Redos <json name="redos">
]

(*****************************************************************************)
(* Taint mode *)
(*****************************************************************************)

type taint_spec = raw_json

(*****************************************************************************)
(* Extract mode *)
(*****************************************************************************)

type extract = raw_json

(*****************************************************************************)
(* Toplevel *)
(*****************************************************************************)

type rules = {
  rules: rule list;

  (* Missed count of pro rules when not logged-in.
   * Sent by the registry to the CLI since 1.48.
   * See https://github.com/semgrep/semgrep-app/pull/11142
   *)
  ?missed: int option;
}

atdgen rule_schema_v2.atd generates:

Fatal error: exception Atd.Parser.MenhirBasics.Error
Raised at Atd__Parser.MenhirBasics._eRR in file "atd/src/parser.ml" (inlined), line 8, characters 6-17
Called from Atd__Parser._menhir_run_043 in file "atd/src/parser.ml", line 2517, characters 10-17
Called from Atd__Parser.full_module in file "atd/src/parser.ml" (inlined), line 3593, characters 34-92
Called from Atd__Util.read_lexbuf in file "atd/src/util.ml", line 14, characters 19-56
Called from Atd__Util.load_file in file "atd/src/util.ml", line 64, characters 6-148
Re-raised at Atd__Util.load_file in file "atd/src/util.ml", line 72, characters 4-11
Called from Atdgen_emit__Ob_emit.make_ocaml_files in file "atdgen/src/ob_emit.ml", line 1364, characters 8-164
Called from Dune__exe__Ag_main in file "atdgen/bin/ag_main.ml", line 428, characters 6-13
Re-raised at Dune__exe__Ag_main in file "atdgen/bin/ag_main.ml", line 435, characters 11-18

instead of a clear parsing error.

aryx commented 1 year ago

cc @mjambon

aryx commented 1 year ago

it was because of the ?type: string option

aryx commented 1 year ago

still, a better error message would be nice.

aryx commented 1 year ago

Low priority.

mjambon commented 1 year ago

This happens because type is a keyword. I think it's a matter of adding a case for handling the error in the menhir file. Here's a minimal atd file with this error:

$ cat bug.atd
type t = { type: string }
$ atdgen bug.atd
Fatal error: exception Atd.Parser.MenhirBasics.Error

(btw, I don't know why I'm not getting a stack trace. I'm using atdgen 2.11.0 as shipped by opam 2.1.0 with ocaml 4.14.0)

tlavoie commented 1 month ago

Even if not closed, the existence of this issue is sufficient to be able to search for, and find, this particular error message. (I just did, having started experimenting with atdgen, so thanks for this!)