Define public API - Githubissues

See discussion and possible endpoints here

Below are some thoughts on the structure of different endpoints, I encourage comments at any level of detail. I was parsimonious in the proposed responses, but in general attempted to allow responses to be flexible/accommodate change. For example, the /edits/rows endpoint returns an object with a single rows key rather than just the rows array because we may want to return metadata on every request (timestamp, etc)).

We may want to create separate issues for each endpoint for discussion, but for now, enjoy this lengthy comment.

GET requests

Endpoint

/institutions/<id>

Response

{
  id: <id>,
  name: <instituion name>,
  submissions: [
    {
      id: <incremental submission id>,
      status: <current status>,
      timestamp: <timestamp>
    }
  ]
}

Clarification/Discussion

<id> A unique id generated for the institution in question (when/where this is generated tbd) <incremental submission id> unique id per institution, incrementing with resubmissions

Submissions as an array: there are several options here:

an array of submission objects (shown above) this requires walking all the statuses in the array to find the latest submission, but feels more idiomatic. If order can be guaranteed in the arrays, the downside is low
an object with incremental ids as the object's keys and submission objects as values, a little more information dense and allows faster lookup by id (if, for example, a currentSubmission key was included in the response, see below)
an array of submission objects with the position of a submission object equal to it's submission id (assuming ids go from 0-n or 1-n). Maybe a little too cute.

currentSubmission Do we/should we provide a currentSubmission key so consumers can more easily get the latest submission (which is usually what people will be interacting with). Also, guaranteeing the order of submissions (eg sorting either descending or ascending on id) would provide just as much ease of use

Guarantees on incremental submission id: should we use a counting service/CRDT? In practice, there shouldn't really be an issue, as new submissions will mostly be infrequent and spaced apart

Lack of year info: I think we should have different endpoints for each year, if not every year, cf.gov/hmda/2018/..., so including it in the response would be redunant (though I have the id in the response which is also redundant by this logic)

Edit Counts? The UI has been mocking up some components that assume the the submission object will have edit counts. This isn't really necessary imo, as the status will give all the actionable info (resubmit/verify/sign, etc) and a click through to the page where one can view edits will display edit counts

Should /submissions/<id> be its own endpoint? I don't think so, but I could see an argument for it

Endpoint

/institutions

Response

{
  institutions: [
    {institution object 1},
    ...,
    {institution object n}
  ]
}

Clarification/Discussion

Returns all the institutions a user is authorized to file for (usually 1, but more for vendors and in some cases holding/parent companies) Could alternatively return an object keyed by the institution objects' ids, which would be a similiar pattern to option 2 above for how the submissions object could be structured

Endpoint

/edits/rows?institution=<institution id>&submission=<submission id>&page=<page>

Response

{
  rows: [
    {
      row: <row>,
      edits: [
        {
          id: <edit id>,
          type: <edit type>,  
          message: <edit failure message>,
          verification: <verification if quality>
        },
        ...
      ]
    },
    ...
  ]
}

Clarification/Discussion

<row> is the LAR in question <edit id> is the weird edit id, eg Q044 This can be used to either link to the edit text or embed the edit text in the row without having to send it over the wire for each row. This also applies to classification of edits (eg HOEPA edits, loan amount, etc). A map of edit categories can be loaded separately if needed <edit type> is syntactical/validity/quality (macro cannot be displayed by row)

<verification will only be present for quality edits, the key will not be part of the syntactical and validity responses. These probably shouldn't be separated in different endpoints because the purpose of the /rows/ endpoint is to collect the different edit type of endpoint together and not to require data reorganization for api consumers

Possible valueSubmitted field. Left out because may not be needed if the <edit failure message> is explicit enough. I'd prefer to rely on the message, since the edit check code has more knowledge about the context of the submittedValue and can couch it properly. Also, I imagine valueSubmitted and the message would be very similar most of the time, eg. "Submitted 2 in the foo field when it must be bar or baz."

Endpoint

/edits/ids?type=<syntactical|validity|quality|macro>&institution=<institution id>&submission=<submission id>&page=<page>

Response

For syntactical/validity

{
  type: <edit type>,
  ids: [
    {
      id: <edit id>,
      rows: [
        {
          row: <row>,
          message: <edit failure message>
        },
        ...
      ]
    },
    ...
  ]
}

For quality:

{
  type: <edit type>,
  ids: [
    {
      id: <edit id>,
      rows: [
        {
          row: <row>,
          message: <edit failure message>,
          verification: <verification>
        },
        ...
      ]
    },
    ...
  ]
}

For macro:

{
  type: <edit type>,
  ids: [
    {
      id: <edit id>,
      message: <edit failure message>,
      verification: <verification>
    },
    ...
  ]
}

Clarification/Discussion

The different edit types not returned in a single call because it often makes sense to query them separately (as they have different signatures)

Endpoint

/progress?institution=<institution id>&submission=<submission id>

Response

{
  status: <current status>,
  editCounts: {
    syntactical: <count>,
    validity: <count>,
    quality: <count>,
    macro: <count>,
  }
}

Clarification/Discussion

Returns the status and editCounts for a submission of an institution. This endpoint should support long-polling

POST requests

Endpoint

/upload

Post data

The LAR file, encoded via multipart/form-data

Response

HTTP 201

Clarification/Discussion

Upload progress can be handled client-side, so this post can be fairly simple

Endpoint

/verify?institution=<institution id>

Post data

{
  id: <edit id>,
  rows: <an array of 1-n rows to verify with the verification text>,
  verification: <verification text>
}

Response

HTTP 201

Clarification/Discussion

Always/only works on the latest submission The institution COULD be part of the post data, but as listed it is more in line with the function of the GET endpoints

Endpoint

/sign?institution=<institution id>

Post data

none/TBD auth stuff

Response

HTTP 201

Clarification/Discussion

Actual form of the signature/final submission is TDB, but whatever it may be it will need an endpoint similar to the above

@wpears, thanks for putting this together. Lots of good stuff here. It's given me lots of ideas. For now, I'll start with URL patterns.

I propose (almost) all resources are based on which institution (/institutions) it falls under. Users will be associated to institutions, and it will make authorization much simpler if we can simply check all requests against that root of the URL.

I also feel like the resources in general could be more hierarchical. For similar reasons, it'd be much easier to define rules at each hierarchical level rather than on a per-endpoint basis as we'd have to do with a flattened model. What are you thoughts on something like:

/institutions
- GET - List of all institutions available to a given account.
- POST - We could use this resource for adding new FIs to the system. This would not be exposed for public use.
/institutions/12345678
- GET - Detailed attributes for a given FI.
- PUT/PATCH - Again, we could update individual FIs here, but would not be available to the public.
...and if we want to make it even friendlier, we could incorporate a sluggified version of the FIs name with the id.
- /institutions/12345678-bank-of-bankerton
/institutions/12345678/users
- GET - List of all users (and perhaps their status) for a given FI.
- POST - Associate a given user with an FI.
- This one is iffy. This may take place in a separate system, and/or may not be available to the public.
Note: There is likely a separate URL structure related to users rooted at /users. More thoughts on this coming soon...
/institutions/12345678/filings
- GET - List of each filing periods with the status of each.
Note: "Filing" may not be the right term. Open to something better.
/institutions/12345678/filings/2018/q1/
- GET - List all submissions for a given filing period. I threw in the q1 since we'll need to support quarterly filings for large FIs sometime soon. If this feels too granular, we could have a .../2018-q1/, or something similar.
- POST - This seems like the logical place for file upload to go.
/institutions/12345678/filings/2018/q1/1
- GET - Details for an individual submission.
/institutions/12345678/filings/2018/q1/1/progress
- GET - Long-poll-y type endpoint that returns the progress of each edit type for a given submission.
- This may not be necessary if the parent URL is lightweight enough. Not sure. I haven't worked with long-polling much to know if this is necessary.
/institutions/12345678/filings/2018/q1/1/edits
- GET - List of all edits for a given submission.
- As @wpears suggested above, we'd likely have additional query parameters here for pagination and to filter edits by type, status, etc.
- POST (PUT?) - If we want to support bulk update of multiple edits at once, this seems like a logical place.
/institutions/12345678/filings/2018/q1/1/edits/1234
- GET - Details for a given edit.
- Perhaps this is not necessary if the parent resource has all data for a given edit.
- PUT/PATCH - Update a given edit, most likely providing justification on quality and macro edits.
- Also may not be necessary if we always used the above-mentioned bulk update for all changes to edits.
- An alternative could be a POST to /institutions/12345678/filings/2018/q1/1/edits/1234/verify
/institutions/12345678/filings/2018/q1/1/sign
- POST - Once all edits are complete, "sign" a given submission.
- PUT/DELETE - We may need a way to "unsign" a given submission.

Good stuff Hans.

Originally I had the structure a bit more hierarchical but changed to more stuff in the querystring because I thought it make the endpoints clearer at a glance (/edits/row?... is clearly most concerned with row-aggregated edits). That said, I am completely okay with the hierarchical approach, especially if it makes defining certain rules easier.

Some other thoughts:

filings could just be year if the quarter is in its own path separator (ie year/2018/q1/...). Makes less sense if the quarter is incorporated in the name (2018-q1). Also, it almost seems quarter should have it's own collection descriptor in the URI (/year/2018/quarter/q1). This seems a shame because the quarters will always be the same and there are only four, but if the quarter is going in the path of the URI, I think it may be necessary. Also, we'll have to make sure that the quarter part of the URI can be ignored by FIs that don't file quarterly, eg /institutions/12345678/filings/2018/quarter/q1/1/sign and /institutions/12345678/filings/2018/1/sign are both valid. The quarter could also go in the querystring ^^;
POSTing to /institutions/12345678/filings/2018/` for the file submission is pretty cool, :+1:
I think a per edit query is fine/could be useful/makes since in a hierarchical URI structure. Do not that many rows may be returned that all fail this edit.
We'll need a row-based way to query edits, maybe /institutions/12345678/filings/2018/rows and /institutions/12345678/filings/2018/rows/1. Is it clear that this URI points to edits (of all types) at a given row/LAR? Should we prefer /institutions/12345678/filings/2018/lars?
Do we want a way to isolate a given row/LAR for a given edit? Doesn't really fit the /edits/ /rows paradigm as they seem like separate structures. If we don't need this, though, then no worries.

@wpears @hkeeler I'm not a big fan of the quarter in the URL, given that it only covers some institutions, and it's far into the future (i.e. not what we are building right now, and not for 2018 either). The quarter can be derived from the response if it includes a timestamp too, or as a query string as @wpears is suggesting.

I have a question about the URL for row-based edits query ,/institutions/12345678/filings/2018/rows/1. What is the identifier for the row (last digit)? The loan id? What is the identifier after "/institutions"?. Are we certain this FI identifier is unique?

The row identifier is line number for the given submission, which I think would be fairly useful for FIs (oh line 34 has a syntax edit.. I'll look at line 34).

I think the institution identifier is something WE generate so we can guarantee it is unique. When I was talking about "creating" institutions in our system, this is what I meant. When somewhen registers to file on behalf of an FI, we create a unique ID to represent that institution and then associate that unique, internal identifier to whatever the institution is they claim to be filing for. Outside of guaranteeing uniqueness, this also allows us a way to differentiate people if, eg, two employees both start filing for the same institution (which wouldn't be possible if we just associate a user to some concatenation of institution/panel data).

@wpears I don't think line number is a good identifier. How can we guarantee that every row is kept in the same place on every file submission? Makes it hard to compare possible errors for the same loan across different submissions. Also, the backend is asynchronous and cannot guarantee ordering of processing. There are ways to keep line numbers around but it is cumbersome as things may not be processed serially as they appear on the file.

The second question gets into panel generation and identity management, we can of course generate our own unique id, but we will have to maintain it and it will be different than anything else that everyone else uses. So I'm curious to see if we can define how institutions are identified today, and use that. Keeping in mind that all of this changes when we get LEI

On the first point, there is an edit for the loan id to be unique within a submission (S040), so that can be used as an identifier instead of the row number

Yeah, I'd thought of the cross-submission problem for rows, but didn't know if that type of comparison would be run (but it makes sense). Using the loan id field sounds better though, so let's move forward with that.

I think this means the rows endpoint I listed in my response to Hans, if we're using loan id, makes more sense as lars... so /institutions/12345678/year/2018/lars/1234).

Also for the /edits/ endpoint the structure will be something like (note loan id):

{
  type: <edit type>,
  ids: [
    {
      id: <edit id>,
      lars: [
        {
          id: <loan id>,
          message: <edit failure message>
        },
        ...
      ]
    },
    ...
  ]
}

re: institution identifier, I think making our own identifier makes the most sense particularly if we are allowing self-service. AFAIK, there isn't a universal, non-public identifier we can use to prove a bank is who they say they are if we've never encountered them before (filing off panel, etc). Of course, we could use public bank info if we also require some key/sessionID that gets sent to banks where we have a known contact, but that essentially rules out self-service. By creating our own identifier that is associated with public, unique bank info, we can allow self-service and also have an easy way to manage conflicts (these two users are filing for the same FI, we need to call them).

Just as a recap to help me put this all together, and to add some of my thoughts, suggestions, and questions.

This comment ignores responses for now, I just want to make sure we all agree on the API endpoints. I think, to avoid clutter in this issue, as we agree on an endpoint we should create a new issue and have that issue focus on the response.

GET /institutions

list all institutions for an account

GET /institutions/{id}

details about an institution
possible slug version with `/{id}-{bank-name}

POST /institutions/{id}

update details about an institution
possible slug version with `/{id}-{bank-name}

GET /institutions/{id}/users

list of users for an institution

GET /institutions/{id}/users/{id}

details of a specific user

GET /institutions/{id}/progress (long polling)

return the ongoing progress of the latest upload
if the progress is complete, meaning all edits have been checked (there could be errors), this would essentially return GET /institutions/{id}/filings/latest (see below)

GET /institutions/{id}/filings

list of filings for an institution for the current year
adding ?year={year} would list the history for a given year
- ?year=all would provide the full history
adding ?quarter={quarter} would list the history for a given quarter (?year={year} would be required)

POST /institutions/{id}/filings

submission of data for the current year
adding ?year={year} would allow re-submission of data
adding ?quarter={quarter} would allow re-submission of data for a given quarter (?year={year} would be required)

GET /institutions/{id}/filings/{id}

details of a filing
- /latest would always give us the current/most recent filing details

GET /institutions/{id}/filings/{id}/edits

list of edits for a given submission
this might be only the first n edits from each type (think paging)
adding ?type={type} would list only the edits of a specific type
adding ?page={page} would allow for paging based on the n default we choose
adding ?per_page={number} would allow for changing the n default we choose
- we may not need (or want) ?per_page={number}
don't think we need a GET /institutions/{id}/filings/{id}/edits/{id}, the above call should provide the details?

POST /institutions/{id}/filings/{id}/edits/{id}

allow for verification for quality, macro, and institution register summary (IRS)
no need for GET?
- the current status of an edit, from GET /institutions/{id}/filings/{id}/edits, would be enough?

GET /institutions/{id}/filings/{id}/irs

return summary data for the IRS report (viewing requires all quality and macro verifications to be complete)

POST /institutions/{id}/filings/{id}/sign

allow for signature (final submission)

During discussion yesterday I think having the ability to have a /year breakdown of these endpoints is also good.

But I do think the most logical breakdown starts with our institutions. While the date/time part of all this is also important, starting with institutions seems to allow our app, and our API endpoints, the ability to focus on the current year (and later the quarter) filings. Current filings seem to be the most important part of this and will be the bulk of the work, I think, for every year. Using query string parameters for the year and quarter still gives us the ability to allow for re-submissions and the ability to pull past data.

This pattern also seems to prevent our endpoint paths from becoming too unwieldy.

Thoughts? Did I miss some?

@awolfe76 Thanks for the summary, I think this helps. I do have some questions about specific endpoints, but they are more questions due to unclear (unknown) data models at this point (i.e. we are centering everything on institutions, but most of the backend work to this point has been on edits in isolation). I think what makes sense is to divide these into individual tickets (or maybe grouped as appropriate) and start implementing them, or at least drive the discussion deeper where it is needed. Overall though, I think this is a good point of departure.

So this differs from how the API is implemented at https://github.com/cfpb/hmda-platform-ui/pull/99 and I think needs to take in to account some of the year vs. institution stuff we spoke about in standup (which I'll have coming in another PR on the platform-ui).

So before we split these things out, I'd say we should probably wait for the next two PRs in the platform-ui repo which should help prove out some of the ideas here.

cfpb / hmda-platform

Define public API #261

GET requests

Endpoint

Response

Clarification/Discussion

Endpoint

Response

Clarification/Discussion

Endpoint

Response

Clarification/Discussion

Endpoint

Response

Clarification/Discussion

Endpoint

Response

Clarification/Discussion

POST requests

Endpoint

Post data

Response

Clarification/Discussion

Endpoint

Post data

Response

Clarification/Discussion

Endpoint

Post data

Response

Clarification/Discussion