Create new test report document schema

preaction commented 7 years ago

The existing test report document that testers send in to the API is a serialized Metabase::Fact object. Since we're moving away from Metabase, we should make a simpler document format to contain all the data from the tester's report. This document will contain all of the same information, but structured differently to be easier to read/use as plain JSON. This document will be described as an OpenAPI / JSON Schema object to ensure the CPAN Testers API only gets valid data. Some sections of the document will allow for custom data to enable future innovation and development.

The document should contain, at least:

The distribution that was tested ( "dist" )
- Name
- Version
- Author
The result of the test ( "report" )
- The full text of the report
- The status of the report (pass, fail, na, unknown)
- Additional structured metadata from the test runner
The language and interpreter ( "lang" )
- Language: Perl 5
- Version: 5.24.0
- Maturity: Stable or Development
- Architecture: x86_64-linux-thread-multi
- Commit ID: For testing pre-release interpreters
The machine, OS, and environment ( "env" )
- OS type
- OS version
- Environment variables

This document must be supported by this API before any testers can begin using it.

garu commented 7 years ago

As for iterations and future compatibility, my suggestion would be to simply version the API. Since we are currently in "CPAN Testers 2.0", we could create an entirely new 3.0 endpoint and work from there, thus allowing people to migrate their clients. Once there are no reports being sent to the metabase backend we can shut it down. This would also allow us to make all clients "forward compatible".

Sending a report would be a simple case of submitting a POST to (e.g.) https://cpantesters.org/api/v3.0/reports and getting a success. Failures would contain an explanation in JSON, so if we update the format in the future, we bump the endpoint to v3.1 (or v4.0) and provide a deprecation policy (3m, 6m, 1y, etc). Once the endpoint reaches end of life, we return an error message like (e.g.) {"error": "deprecated API endpoint. Please upgrade your client to use version XXX" }.

This is my initial schema proposition, in OpenAPI format:

{
  "swagger": "2.0",
  "info": {
    "version": "3.0",
    "title": "CPAN Testers 3.0",
    "description": "API to submit and manipulate test reports",
    "termsOfService": "http://cpantesters.org/api/license",
    "contact": {
      "name": "CPAN Testers core team",
      "email": "admin@cpantesters.org",
      "url": "cpantesters.org"
    },
    "license": {
      "name": "MIT",
      "url": "..."
    }
  },
  "host": "cpantesters.org",
  "basePath": "/api/v3.0",
  "schemes": [ "https" ],
  "consumes": [ "applicatoin/json" ],
  "produces": [ "application/json" ],
  "paths": {
    "/reports": {
      "post": {
        "description": "Submit a new report to CPAN Testers",
        "openationId": "addReport",
        "parameters": [{
          "in": "body",
          "required": true,
          "schema": { "$ref": "#/definitions/NewReport" }
        }],
        "responses": {
          "202": {
            "description": "accepted reports",
            "schema": { "$ref": "#/definitions/AcceptedReports" }
          },
          "401": {
            "description": "invalid credentials",
            "schema": { "$ref": "#/definitions/Error" }
          },
          "400": {
            "description": "report contains errors",
            "schema": { "$ref": "#/definitions/Error" }
          },
          "default": {
            "description": "unexpected error",
            "schema": { "$ref": "#/definitions/Error" }
          }
        }
      }
    }
  },
  "definitions": {
    "NewReport": {
      "type": "object",
      "required": [ "sender", "environment", "distributions" ],
      "properties": {
        "sender": {
          "type": "object",
          "schema": { "$ref": "#/definitions/Sender" }
        },
        "environment": {
          "type": "object",
          "schema": { "$ref": "#/definitions/Environment" }
        },
        "distributions": {
          "type": "object",
          "schema": { "$ref": "#/definitions/Distributions" }
        }
      }
    },
    "Sender": {
      "type": "object",
      "required": ["name", "token"],
      "properties": {
        "name":  { "type": "string" },
        "token": { "type": "string" }
      }
    },
    "Environment": {
      "type": "object",
      "required": ["language", "system"],
      "properties": {
        "language": {
          "type": "object",
          "schema": { "$ref": "#/definitions/Language" }
        },
        "system": {
          "type": "object",
          "schema": { "$ref": "#/definitions/System" }
        },
        "reporter": { "type": "string" },
        "toolchain": {
          "type": "object",
          "additionalProperties": { "type": "string" }
        }
      }
    },
    "Language": {
      "type": "object",
      "required": ["name", "version"],
      "properties": {
        "name":           { "type": "string" },
        "version":        { "type": "string" },
        "implementation": { "type": "string" },
        "vm":             { "type": "object" },
        "build":          { "type": "string" },
        "commit_id":      { "type": "string" },
        "variables": {
          "type": "object",
          "additionalProperties": { "type": "string" }
        }
      }
    },
    "System": {
      "type": "object",
      "required": ["osname"],
      "properties": {
        "osname":    { "type": "string" },
        "osversion": { "type": "string" },
        "archname":  { "type": "string" },
        "hostname":  { "type": "string" },
        "cpu_count": { "type": "string" },
        "cpu_type":  { "type": "string" },
        "cpu_description": { "type": "string" },
        "variables": {
          "type": "object",
          "additionalProperties": { "type": "string" }
        }
      }
    },
    "Distributions": {
      "type": "array",
      "items": {
        "type": "object",
        "schema": { "$ref": "#/definitions/Distribution" }
      }
    },
    "Distribution": {
      "type": "object",
      "required": ["name", "version", "grade", "output"],
      "properties": {
        "name": { "type": "string" },
        "version": { "type": "string" },
        "grade": {
          "type": "string",
          "enum": ["PASS", "FAIL", "NA", "UNKNOWN"]
        },
        "tests":    { "type": "integer", "minimum": 0 },
        "failures": { "type": "integer", "minimum": 0 },
        "skipped":  { "type": "integer", "minimum": 0 },
        "todo":     { "type": "integer", "minimum": 0 },
        "warnings": { "type": "integer", "minimum": 0 },
        "duration": { "type": "integer", "minimum": 0 },
        "comments": { "type": "string" },
        "output": {
          "type": "object",
          "schema": { "$ref": "#/definitions/TestOutput" }
        },
        "prerequisites": {
          "type": "object",
          "schema": { "$ref": "#/definitions/Prerequisites" }
        }
      }
    },
    "TestOutput": {
      "type": "object",
      "properties": {
        "uncategorized": { "type": "string" },
        "configure":     { "type": "string" },
        "build":         { "type": "string" },
        "test":          { "type": "string" },
        "install":       { "type": "string" }
      }
    },
    "Prerequisites": {
      "type": "object",
      "properties": {
        "requires": {
          "type": "object",
          "additionalProperties": { "$ref": "#!/definitions/NeedHave" }
        },
        "recommends": {
          "type": "object",
          "additionalProperties": { "$ref": "#!/definitions/NeedHave" }
        },
        "build_requires": {
          "type": "object",
          "additionalProperties": { "$ref": "#!/definitions/NeedHave" }
        },
        "configure_requires": {
          "type": "object",
          "additionalProperties": { "$ref": "#!/definitions/NeedHave" }
        },
        "test_requires": {
          "type": "object",
          "additionalProperties": { "$ref": "#!/definitions/NeedHave" }
        }
      }
    },
    "NeedHave": {
      "type": "object",
      "required": ["need"],
      "properties": {
        "need": { "type": "string" },
        "have": { "type": "string" }
      }
    },
    "AcceptedReports": {
      "type": "object",
      "additionalProperties": {
        "type": "object",
        "required": ["status"],
        "properties": {
          "status": { "type": "string" },
          "id": { "type": "string" }
        }
      }
    },
    "Error": {
      "type": "object",
      "required": ["code", "message"],
      "properties": {
        "code":    { "type": "integer", "format": "int32" },
        "message": { "type": "string" }
      }
    }
  }
}

Since I think the spec is a bit difficult to grasp, this is my "minimal report" sample, with only required fields:

  {
    "sender": {
      "name": "YOUR-CPANTESTERS-USERNAME",
      "token": "YOUR-CPANTESTERS-TOKEN"
    },
    "environment": {
      "language": {
        "name": "Perl 5",
        "version": "5.25.11"
      },
      "system": {
        "osname": "GNU/Linux"
      }
    },
    "distributions": [{
      "name": "Some-Module",
      "version": "0.37",
      "grade": "PASS",
      "output": {
        "uncategorized": "OUTPUT OF THE ENTIRE BUILD PROCESS"
      }
    }]
  }

And a complete report:

  {
    "sender": {
      "name": "YOUR-CPANTESTERS-USERNAME",
      "token": "YOUR-CPANTESTERS-TOKEN",
    },
    "environment": {
      "reporter": "CPAN-Reporter 2.34",
      "language": {
        "name": "Perl 5",
        "version": "5.25.11",
        "implementation": "perl",
        "commit_id": "5fc389563644287b3e5f448616ce62dd0ce4e7a6",
        "variables": {
          "$^X": "/home/cpan4/install/bin/perl"
        },
        "build": "OUTPUT OF 'perl -V'"
      },
      "system": {
        "osname": "GNU/Linux",
        "osversion": "3.16.0-4-amd64",
        "archname": "x86_64-linux-thread-multi",
        "hostname": "localhost",
        "cpu_count": 1,
        "cpu_type": "Intel Core i5",
        "cpu_description": "MacBook Air (1.3 GHz)",
        "variables": {
          "PATH": "/usr/local/bin:/usr/bin:/bin:/usr/local/games:/usr/games",
          "PERL5LIB": ""
        }
      },
      "toolchain": {
        "CPAN": "2.16",
        "ExtUtils::MakeMaker": "7.24"
      },
    },
    "distributions": [{
      "name": "Some-Module",
      "version": "0.37",
      "grade": "PASS",
      "tests": 39,
      "failures": 0,
      "skipped": 2,
      "todo": 0,
      "warnings": 1,
      "duration": 32,
      "output": {
        "uncategorized":   "ENTIRE BUILD. MAY BE OMITTED",
        "configure": "OUTPUT OF 'perl Makefile.PL' OR SIMILAR",
        "build":     "OUTPUT OF 'make' OR SIMILAR",
        "test":      "OUTPUT OF 'make test' OR SIMILAR",
        "install":   "OUTPUT OF 'make install' OR SIMILAR"
      },
      "prerequisites": {
        "requires": {
          "Capture::Tiny": { "need": "0", "have": "0.46" }
        },
        "recommends": {
          "YAML": { "need": "1.33" }
        },
        "build_requires": {
          "Archive::Tar": { "need": "1.54", "have": "2.24" }
        },
        "configure_requires": {
          "ExtUtils::MakeMaker": { "need": "6.17", "have": "7.24" }
        },
        "test_requires": {
          "CPAN::Meta": { "need": "2.120900", "have": "2.150010" }
        },
      },
      "comments": "This is a computer generated report..."
    }],
  }

Server replies would be like:

  {
    "Some-Module 0.23": {
      "status": "queued",
      "id": "d67f29d0-18a4-11e7-a074-e1beba07c9dd"
    },
    "Other-Module 0.23": {
      "status": "duplicate report",
      "id": "d67f29d0-18a4-11e7-a074-e1beba07c9dd"
    },
    "Cant-Find-ME 0.00": {
      "status": "not found"
    }
  }

Goes without saying that the above formats and definitions are not meant to be final, it's just to get the conversation started during the PTS :)

I already have a sample web application with the /api/v3.0 route that is able to validate and parse reports in that format and write them to a database. I made it so it will be super easy for us to update the schema and run as many prototypes as we need until we decide which direction to go.

Initially I had put a "perl" field, but your "language" construction felt better so I made the adjustments that became the above format. I'd love it for us to support different perls and, in case of Perl 6, there must be some extra data like language version (e.g. "6.c"), compiler version (e.g. for "rakudo" or "perlito"), virtual machine ("moarvm", "jvm") and others. Hopefully we'll figure out what the important ones are during the PTS.

preaction commented 7 years ago

Yes, the API is currently versioned, but I want to keep new versions to a minimum. There is likely no way for us to get away with even deprecating an API, much less removing it (especially for incoming test reports), as there is effectively no way to communicate deprecations via the HTTP protocol. I can scream on blogs.perl.org, blog.cpantesters.org, e-mail, and Twitter as much as I want, but if they aren't listening and an API goes dead, incoming test reports are going to be dropped. The shim API (#3) will likely live forever.

Also, since the backend uses the current format, we'll need to be able to translate back and forth until the backend can be updated to use the new format (see cpan-testers/cpantesters-schema#5 and cpan-testers/cpantesters-backend#3). The real benefits of being able to run queries on this structure will be a ways off.

But otherwise, this report structure looks great! I have a couple minor things I think could be improved:

"sender" should probably be named "reporter" or "author"
- After the report is being retrieved by others later as a GET request, "sender" has a more ambiguous meaning
When listing prerequisites, we should not duplicate what MetaCPAN already has
- We don't really care about what was recommended, we care about what the tester installed and tested against
- I could see keeping this just to help declare that a prereq exists when used with dynamic prereqs
Prerequisites could be a flat array of objects with a "name", "need_version", "have_version" and an optional "phase" field containing "build", "test", "configure"
- This makes it easier to query for prerequisites later, since they'll always be in the same place. No need to search all of "requires", "test_requires", "recommends", "build_requires", "configure_requires" hash keys to find the distributions / test reports that have a certain prereq distribution/version.
Only one distribution should be sent in a report
- This conforms to REST API ideals more: One request transfers the state of one object
- The response will contain the ID of the report created
- It removes the ambiguity of the HTTP status code if only some of the incoming reports were successful
- It allows a POST to send the report and the subsequent GET to be the exact same data structure

garu commented 7 years ago

I'm not going to pretend I understand the complexities of the current backend, specially when it comes to assessing the required effort to change it, so I would love to work on them with you during the PTS in order to get as close to the ideal environment as possible :)

As for the issues you mentioned:

I named it "sender" only because I used it for validation. I have prototyped the GET /api/v3.0/reports/:id endpoint returning a similar but not identical structure: we would skip the sender token (for obvious reasons), and we might augment it with data from consumers, like distribution author and any extra correlations, statistical regressions or anything else from andk's scripts. If you want to make it closer to the actual retrievable document no problem: we just move the authentication token to the HTTP header.

1.5. Note I have omitted some fields on purpose: dist author can be easily queried on metacpan, or at least better queried than on the client side (iirc that code relies on parsing the full path for the dist, which may not be available); whether the dist is on cpan or backpan, for the same reason, specially since they move all the time, making that data not really usefull; and version maturity, which I think is better controlled on the server side based on the provided version, if for example a stable release is deemed broken.

The more we can skip on the POST, the simpler it is for the clients, so I'm all for it. But we'd still have to figure prereqs out on the client side to submit what was satisfied and what wasn't, so the effort would already be there anyway. Also, like you said, the author might be interested in seeing what the installation thought was required on that perl/os/arch.
If you want the prereqs format to be a flat array of objects, no problem - I'm just not sure I understand the reasoning. I mean, when we talk about prereqs, shouldn't the phase make a difference? i.e. if a dist only recommends X, does X really count as a prereq (when searching)?

On a related note: how do you think we should tackle cases where the same distribution appears in more than one phase? E.g. the author mentions Carp under test_requires and build_requires? What if they have different versions, too? Should we signal this somehow? An array of objects would make this easier too, I guess.

No problem! I have struggled with this, the reasoning behind being someone might be interested in sending reports in bulk, and I didn't want to stress the server side with too many queries. But I fully agree, one dist per POST is better (makes it easier on the client side, too). If we ever feel a need to support reports in bulk, we can add it as a different route.

This way the GET would have the exact same data structure, but augmented by a few fields (like id, status, author, language maturity, and anything else we want). OpenAPI supports this just fine, too.

cpan-testers / cpantesters-api

Create new test report document schema #5