cfpb / hmda-platform

The HMDA Submission backend applications.
Creative Commons Zero v1.0 Universal
102 stars 93 forks source link

Define public API #261

Closed jmarin closed 8 years ago

jmarin commented 8 years ago

See discussion and possible endpoints here

wpears commented 8 years ago

Below are some thoughts on the structure of different endpoints, I encourage comments at any level of detail. I was parsimonious in the proposed responses, but in general attempted to allow responses to be flexible/accommodate change. For example, the /edits/rows endpoint returns an object with a single rows key rather than just the rows array because we may want to return metadata on every request (timestamp, etc)).

We may want to create separate issues for each endpoint for discussion, but for now, enjoy this lengthy comment.

GET requests

Endpoint

/institutions/<id>

Response

{
  id: <id>,
  name: <instituion name>,
  submissions: [
    {
      id: <incremental submission id>,
      status: <current status>,
      timestamp: <timestamp>
    }
  ]
}

Clarification/Discussion

<id> A unique id generated for the institution in question (when/where this is generated tbd) <incremental submission id> unique id per institution, incrementing with resubmissions

Submissions as an array: there are several options here:

currentSubmission Do we/should we provide a currentSubmission key so consumers can more easily get the latest submission (which is usually what people will be interacting with). Also, guaranteeing the order of submissions (eg sorting either descending or ascending on id) would provide just as much ease of use

Guarantees on incremental submission id: should we use a counting service/CRDT? In practice, there shouldn't really be an issue, as new submissions will mostly be infrequent and spaced apart

Lack of year info: I think we should have different endpoints for each year, if not every year, cf.gov/hmda/2018/..., so including it in the response would be redunant (though I have the id in the response which is also redundant by this logic)

Edit Counts? The UI has been mocking up some components that assume the the submission object will have edit counts. This isn't really necessary imo, as the status will give all the actionable info (resubmit/verify/sign, etc) and a click through to the page where one can view edits will display edit counts

Should /submissions/<id> be its own endpoint? I don't think so, but I could see an argument for it

Endpoint

/institutions

Response

{
  institutions: [
    {institution object 1},
    ...,
    {institution object n}
  ]
}

Clarification/Discussion

Returns all the institutions a user is authorized to file for (usually 1, but more for vendors and in some cases holding/parent companies) Could alternatively return an object keyed by the institution objects' ids, which would be a similiar pattern to option 2 above for how the submissions object could be structured

Endpoint

/edits/rows?institution=<institution id>&submission=<submission id>&page=<page>

Response

{
  rows: [
    {
      row: <row>,
      edits: [
        {
          id: <edit id>,
          type: <edit type>,  
          message: <edit failure message>,
          verification: <verification if quality>
        },
        ...
      ]
    },
    ...
  ]
}

Clarification/Discussion

<row> is the LAR in question <edit id> is the weird edit id, eg Q044 This can be used to either link to the edit text or embed the edit text in the row without having to send it over the wire for each row. This also applies to classification of edits (eg HOEPA edits, loan amount, etc). A map of edit categories can be loaded separately if needed <edit type> is syntactical/validity/quality (macro cannot be displayed by row)

<verification will only be present for quality edits, the key will not be part of the syntactical and validity responses. These probably shouldn't be separated in different endpoints because the purpose of the /rows/ endpoint is to collect the different edit type of endpoint together and not to require data reorganization for api consumers

Possible valueSubmitted field. Left out because may not be needed if the <edit failure message> is explicit enough. I'd prefer to rely on the message, since the edit check code has more knowledge about the context of the submittedValue and can couch it properly. Also, I imagine valueSubmitted and the message would be very similar most of the time, eg. "Submitted 2 in the foo field when it must be bar or baz."

Endpoint

/edits/ids?type=<syntactical|validity|quality|macro>&institution=<institution id>&submission=<submission id>&page=<page>

Response

For syntactical/validity

{
  type: <edit type>,
  ids: [
    {
      id: <edit id>,
      rows: [
        {
          row: <row>,
          message: <edit failure message>
        },
        ...
      ]
    },
    ...
  ]
}

For quality:

{
  type: <edit type>,
  ids: [
    {
      id: <edit id>,
      rows: [
        {
          row: <row>,
          message: <edit failure message>,
          verification: <verification>
        },
        ...
      ]
    },
    ...
  ]
}

For macro:

{
  type: <edit type>,
  ids: [
    {
      id: <edit id>,
      message: <edit failure message>,
      verification: <verification>
    },
    ...
  ]
}

Clarification/Discussion

The different edit types not returned in a single call because it often makes sense to query them separately (as they have different signatures)

Endpoint

/progress?institution=<institution id>&submission=<submission id>

Response

{
  status: <current status>,
  editCounts: {
    syntactical: <count>,
    validity: <count>,
    quality: <count>,
    macro: <count>,
  }
}

Clarification/Discussion

Returns the status and editCounts for a submission of an institution. This endpoint should support long-polling

POST requests

Endpoint

/upload

Post data

The LAR file, encoded via multipart/form-data

Response

HTTP 201

Clarification/Discussion

Upload progress can be handled client-side, so this post can be fairly simple

Endpoint

/verify?institution=<institution id>

Post data

{
  id: <edit id>,
  rows: <an array of 1-n rows to verify with the verification text>,
  verification: <verification text>
}

Response

HTTP 201

Clarification/Discussion

Always/only works on the latest submission The institution COULD be part of the post data, but as listed it is more in line with the function of the GET endpoints

Endpoint

/sign?institution=<institution id>

Post data

none/TBD auth stuff

Response

HTTP 201

Clarification/Discussion

Actual form of the signature/final submission is TDB, but whatever it may be it will need an endpoint similar to the above

hkeeler commented 8 years ago

@wpears, thanks for putting this together. Lots of good stuff here. It's given me lots of ideas. For now, I'll start with URL patterns.

I propose (almost) all resources are based on which institution (/institutions) it falls under. Users will be associated to institutions, and it will make authorization much simpler if we can simply check all requests against that root of the URL.

I also feel like the resources in general could be more hierarchical. For similar reasons, it'd be much easier to define rules at each hierarchical level rather than on a per-endpoint basis as we'd have to do with a flattened model. What are you thoughts on something like:

wpears commented 8 years ago

Good stuff Hans.

Originally I had the structure a bit more hierarchical but changed to more stuff in the querystring because I thought it make the endpoints clearer at a glance (/edits/row?... is clearly most concerned with row-aggregated edits). That said, I am completely okay with the hierarchical approach, especially if it makes defining certain rules easier.

Some other thoughts:

jmarin commented 8 years ago

@wpears @hkeeler I'm not a big fan of the quarter in the URL, given that it only covers some institutions, and it's far into the future (i.e. not what we are building right now, and not for 2018 either). The quarter can be derived from the response if it includes a timestamp too, or as a query string as @wpears is suggesting.

I have a question about the URL for row-based edits query ,/institutions/12345678/filings/2018/rows/1. What is the identifier for the row (last digit)? The loan id? What is the identifier after "/institutions"?. Are we certain this FI identifier is unique?

wpears commented 8 years ago

The row identifier is line number for the given submission, which I think would be fairly useful for FIs (oh line 34 has a syntax edit.. I'll look at line 34).

I think the institution identifier is something WE generate so we can guarantee it is unique. When I was talking about "creating" institutions in our system, this is what I meant. When somewhen registers to file on behalf of an FI, we create a unique ID to represent that institution and then associate that unique, internal identifier to whatever the institution is they claim to be filing for. Outside of guaranteeing uniqueness, this also allows us a way to differentiate people if, eg, two employees both start filing for the same institution (which wouldn't be possible if we just associate a user to some concatenation of institution/panel data).

jmarin commented 8 years ago

@wpears I don't think line number is a good identifier. How can we guarantee that every row is kept in the same place on every file submission? Makes it hard to compare possible errors for the same loan across different submissions. Also, the backend is asynchronous and cannot guarantee ordering of processing. There are ways to keep line numbers around but it is cumbersome as things may not be processed serially as they appear on the file.

The second question gets into panel generation and identity management, we can of course generate our own unique id, but we will have to maintain it and it will be different than anything else that everyone else uses. So I'm curious to see if we can define how institutions are identified today, and use that. Keeping in mind that all of this changes when we get LEI

jmarin commented 8 years ago

On the first point, there is an edit for the loan id to be unique within a submission (S040), so that can be used as an identifier instead of the row number

wpears commented 8 years ago

Yeah, I'd thought of the cross-submission problem for rows, but didn't know if that type of comparison would be run (but it makes sense). Using the loan id field sounds better though, so let's move forward with that.

I think this means the rows endpoint I listed in my response to Hans, if we're using loan id, makes more sense as lars... so /institutions/12345678/year/2018/lars/1234).

Also for the /edits/ endpoint the structure will be something like (note loan id):

{
  type: <edit type>,
  ids: [
    {
      id: <edit id>,
      lars: [
        {
          id: <loan id>,
          message: <edit failure message>
        },
        ...
      ]
    },
    ...
  ]
}

re: institution identifier, I think making our own identifier makes the most sense particularly if we are allowing self-service. AFAIK, there isn't a universal, non-public identifier we can use to prove a bank is who they say they are if we've never encountered them before (filing off panel, etc). Of course, we could use public bank info if we also require some key/sessionID that gets sent to banks where we have a known contact, but that essentially rules out self-service. By creating our own identifier that is associated with public, unique bank info, we can allow self-service and also have an easy way to manage conflicts (these two users are filing for the same FI, we need to call them).

awolfe76 commented 8 years ago

Just as a recap to help me put this all together, and to add some of my thoughts, suggestions, and questions.

This comment ignores responses for now, I just want to make sure we all agree on the API endpoints. I think, to avoid clutter in this issue, as we agree on an endpoint we should create a new issue and have that issue focus on the response.

GET /institutions

GET /institutions/{id}

POST /institutions/{id}

GET /institutions/{id}/users

GET /institutions/{id}/users/{id}

GET /institutions/{id}/progress (long polling)

GET /institutions/{id}/filings

POST /institutions/{id}/filings

GET /institutions/{id}/filings/{id}

GET /institutions/{id}/filings/{id}/edits

POST /institutions/{id}/filings/{id}/edits/{id}

GET /institutions/{id}/filings/{id}/irs

POST /institutions/{id}/filings/{id}/sign

During discussion yesterday I think having the ability to have a /year breakdown of these endpoints is also good.

But I do think the most logical breakdown starts with our institutions. While the date/time part of all this is also important, starting with institutions seems to allow our app, and our API endpoints, the ability to focus on the current year (and later the quarter) filings. Current filings seem to be the most important part of this and will be the bulk of the work, I think, for every year. Using query string parameters for the year and quarter still gives us the ability to allow for re-submissions and the ability to pull past data.

This pattern also seems to prevent our endpoint paths from becoming too unwieldy.

Thoughts? Did I miss some?

jmarin commented 8 years ago

@awolfe76 Thanks for the summary, I think this helps. I do have some questions about specific endpoints, but they are more questions due to unclear (unknown) data models at this point (i.e. we are centering everything on institutions, but most of the backend work to this point has been on edits in isolation). I think what makes sense is to divide these into individual tickets (or maybe grouped as appropriate) and start implementing them, or at least drive the discussion deeper where it is needed. Overall though, I think this is a good point of departure.

wpears commented 8 years ago

So this differs from how the API is implemented at https://github.com/cfpb/hmda-platform-ui/pull/99 and I think needs to take in to account some of the year vs. institution stuff we spoke about in standup (which I'll have coming in another PR on the platform-ui).

So before we split these things out, I'd say we should probably wait for the next two PRs in the platform-ui repo which should help prove out some of the ideas here.