GALAglobal / TAPICC-API-implementation

TAPICC API implementation using node.js framework sails.js
Other
6 stars 1 forks source link

Proposal to alter the data model #58

Closed jcompton-moravia closed 5 years ago

jcompton-moravia commented 5 years ago

As per our WG#4 discussion on December 5th, we discussed an alteration to the data model that's based-on the idea of Inputs and Outputs that are connected via a Task. We're calling the outputs created by a Task "Deliverables" because that's by definition what a deliverable is. Inputs can be instructions, reference, glossary, TMs, or "files" that need to be transformed (includes JSON serialized XLIFF, of course).

A Task can have any number of Inputs (maybe even zero inputs if the "directive" can be defined by the metadata of a Task?), and can produce any number of Deliverables. Depending on the Task Type, those deliverables may or may not be a direct transform of an input. In the case where a Deliverable has a direct, transformative relationship with an Input file (e.g. "Charmander" turns into "Charmeleon"), that Deliverable is associated with an Input ID. Otherwise, the Deliverable is always associated with what inputs were used to create it because it's a child of a Task (that's associated with Inputs).

The idea of de-coupling inputs from tasks, that is, not dictating a strict parent/child taxonomy between tasks and files (or vice versa) is designed to make it easy for any CMS system, TMS system, or other system that endeavors to be part of a TAPICC ecosystem to be able to "map" their worldview to this data model. It is in direct response to feedback that previous data models were too rigid and in conflict with the data models of systems that TAPICC needs to connect to.

Please see a sketch of the data model below, and share your questions, concerns, and ideas. Thx!

new_tapicc_data_structure_idea 1

terales commented 5 years ago

I love your idea, here is my attempt to create a more precise data model based on your idea (webhooks and all other data types are omitted for now):

image Source file for the visualization in Visio 2016 format

Concerns:

Andrew Gibbons @assembledStarDust: how well arrays would scale?

Let's put a 65535 items restriction on the size of array identifiers. In the edge case of utf8 4 bytes chars used in id's for all 1024 characters, it would take up to 270 MB of data transfer or 540 MB if two references would be filled at maximum, but for 99.9% cases, it would be much less.

Also, we would introduce endpoints for quick polling< which would exclude all arrays, as well as special ones which would supply these arrays only with REST-based pagination:

/jobs/{Job.id}/meta (omits all references, only metadata)

/jobs/{Job.id}/inputs (respects REST pagination)
/jobs/{Job.id}/inputs/meta (get's all inputs with internal files only)

/jobs/{Job.id}/tasks/{Task.id}
/jobs/{Job.id}/tasks/{Task.id}/meta (omits all references, only metadata)
/jobs/{Job.id}/tasks/{Task.id}/inputs (respects REST pagination)
/jobs/{Job.id}/tasks/{Task.id}/inputs/meta (get's all inputs with internal files only)
/jobs/{Job.id}/tasks/{Task.id}/deliverables (respects REST pagination)
/jobs/{Job.id}/tasks/{Task.id}/deliverables/meta (get's all inputs with internal files only)

…

Jim Compton @jcompton-moravia: what if we have only instructions or links as Tasks input or deliverable? Or even just date as Task deliverable?

what if we have only instructions or links as Tasks input or deliverable?

We can just pass it as a text file and underlying HTTP protocol would handle all encoding, compression and type detection work.

Or even just date as Task deliverable?

Then we can change Task status and set deliveredDate to the date we need.

jcompton-moravia commented 5 years ago

Quick point: "Job" remains the loose binding that can associate Tasks together if they're related, for example, some translation activity that needs to be performed in multiple languages. A Task can be thought of as a discrete unit of work, executed by a single entity, as far as the Task creator is concerned. In a real-world workflow that Task can be broken up into child tasks and distributed to different executors by any system that can process it.

terales commented 5 years ago

In a real-world workflow that Task can be broken up into child tasks and distributed to different executors by any system that can process it.

TAPICC API doesn't care: one TAPICC Job is one transaction between two parties.

If Customer has a job inside their TMS that have TaskA, TaskB and TaskC for three different vendors, then they would send three separate TAPICC Jobs for each vendor where each TAPICC Job would contain only those tasks which are available to appropriate Vendors. TAPICC API doesn't know anything or deal anyhow with the internal structure of a job inside their TMS.

Also, Customer is not aware from TAPICC API about the internal processing of the assigned tasks performed by Vendor. For example, Translation task type is usually one task for Customer and a whole workflow for Vendor.

Jim, does it makes sense?

Alino commented 5 years ago

I have prepared few questions during the meeting, and then asked Jim for answers:

  1. Are Deliverables created at the time when a Task is created (by human or computer) or when the the Deliverable is about to be uploaded by human?

    Deliverables are not created when the Task is created. They are created by the executor of the Task.

  2. When Deliverables are uploaded, are new Inputs for them automatically created as well, so that you can associate Tasks with them?

    I suppose that we could have some method that deliberately creates an Input from a Deliverable, but I also think that we could just make this the responsibility of the workflow system. That is, its TAPICC's job to retrieve that file, but it is the backend system's responsibility to version control it. What are your thoughts about turning outputs into inputs for different Tasks? Should that be a function of TAPICC?

well both makes sense, it's hard for me to guess what is more convenient for the implementers or users

  1. What if the user needs to create a chain of Tasks that are sequential? Does he have to wait until the Deliverables are finished, so that he can create next Tasks? When there's a chain of tasks that are sequential, where that sequence is managed by a separate workflow system, and where the output of one task is required as the input of another, I would expect that either the next task wouldn't be created until the input for it was available, or it could be created but wouldn't be assigned until that input was available...

Andrew, are you comfortable with the answer to question number 1 ? I thought this might be in conflict with your idea about direct discovery of Assets somehow.

assembledStarDust commented 5 years ago

Reviewing @terales model, and considering a use case of a client asking for source files to be translated.

  1. A job object created.
  2. A task object created.
  3. Inputs created.
  4. Files attached to inputs.

While some TMS/CMS systems do not have a specific "create job" task, this could be a kind of "mapping" that the underlying TMS/CMS system needs to do to confirm to the spec. I'm comfortable with that.

wrt scalability, I'd like to propose to drop all references to arrays of objects, and use only endpoints.

I'm not really understanding the following endpoints

/jobs/{Job.id}/inputs (respects REST pagination)
/jobs/{Job.id}/inputs/meta (get's all inputs with internal files only)

As these don't seem to conform to @terales model. Are they short cuts?

/jobs/{Job.id}/tasks/{Task.id}/meta (omits all references, only metadata)
/jobs/{Job.id}/tasks/{Task.id}

There are a number of endpoints which detail only meta data. What would that look like? How would they be different from the non meta endpoint?

One other point - once a task is finished, how does the next task know which deliverable to pick up from which previous task as an input?

terales commented 5 years ago

@assembledStarDust

wrt scalability, I'd like to propose to drop all references to arrays of objects, and use only endpoints. As these don't seem to conform to @terales model. Are they short cuts? There are a number of endpoints which detail only meta data. What would that look like? How would they be different from the non meta endpoint?

Hm, seems ok to me. If we drop all references then to start working on job TAPICC vendor will need to make 5 requests:

/jobs/{Job.id}/ (omits all references, only metadata)
/jobs/{Job.id}/tasks/
/jobs/{Job.id}/tasks/{Task.id}/inputs
/jobs/{Job.id}/tasks/{Task.id}/inputs/{inputId}/files
/jobs/{Job.id}/tasks/{Task.id}/inputs/{inputId}/files/{fileId}/downloadfile

It has some overhead for small jobs with a single file, but it looks very maintainable and durable to handle different directions in evolution without breaking backward compatibility.

Here is an updated data model (updated to be consistent with draft in swagger): image Source file for the visualization in Visio 2016 format

assembledStarDust commented 5 years ago

@terales looks good to me.


One other point - once a task is finished, how does the next task know which deliverable to pick up from which previous task as an input?

Considering that the inputs to any particular task be defined by the underlying system.

edit: actually already asked and answered as question 2 @Alino commented on Dec 5, 2018

Alino commented 5 years ago

Here is a list of things that are not yet clear to me:

  1. if the Input is a type of multiple "files" then why don't we just ZIP those files. In that case we don't need File model and the Input and Deliverable can have one file themselves. And we will have less API endpoints.
  2. what is the externalId on Input and Deliverable?
  3. should Deliverable also have name attribute?
  4. how are Inputs directly associated with Deliverables? I guess when you are uploading the Deliverable, you will need to specify Task.id and Input.id correct?
  5. should Task have deliverables attribute? (would show all deliverables)
  6. should Input and Deliverable have encoding attribute?
  7. should Input and Deliverable have languageCode attribute?
terales commented 5 years ago

Alex, thanks for the detailed review! I'm happy to see that we are making progress there.

  1. if the Input is a type of multiple "files" then why don't we just ZIP those files. In that case we don't need File model and the Input and Deliverable can have one file themselves. And we will have less API endpoints.

Agree, changed.

  • what is the externalId on Input and Deliverable?

ID that can be used by Customer and Vendor systems. Agree, this one is redundant. Name is enough

  • should Deliverable also have name attribute?

Yep, added.

  • how are Inputs directly associated with Deliverables? I guess when you are uploading the Deliverable, you will need to specify Task.id and Input.id correct?

Yes, correct. Changed to "one-to-many" relation

  • should Task have deliverables attribute? (would show all deliverables)

No, by convention this kind of lists are retrievable only by REST-endpoint. https://github.com/GALAglobal/TAPICC-API-implementation/issues/58#issuecomment-458406387

  • should Input and Deliverable have encoding attribute?

ZIP and included files format that you've suggested should handle it, added a note about encoding filenames in UTF8 in ZIP.

  • should Input and Deliverable have languageCode attribute?

There could be one or many languages inside. Let's keep languages info inside task (added sourceLang attribute there).


Updated model: image Source file for the visualization in Visio 2016 format

Alino commented 5 years ago

Thank you Sasha for your answers. I am trying implement this model in the codebase. During that process few more questions have risen:

  1. should an Input have deliverables attribute? (would contain ids of associated Deliverables)
  2. is this a complete list of possible fileType values for an Input? ['bitext', 'source', 'instructions', 'glossary', 'TM', 'reference', 'other']
  3. does Deliverable reuse the exact same list of values for fileType as an Input does ? (on Jim's screenshot there is also report should we include report also for Input?)
  4. should Input's fileType default to 'source' or be required?
  5. should Deliverable's fileType be required or use any default value? (I think it should be required)
  6. why would we support downloadLink if we are about to already support this endpoint GET /inputs/{inputId}/file and GET /deliverable/{deliverableId}/file isn't downloadLink unnecessarily redundant? Or would the value of downloadLink be that API endpoint to get the file?
  7. should Task have deliverables attribute? (array of ids) (similar question asked before, but this one would not populate the array with data, but only ids)
  8. shouldn't a Task have inputs attribute? (array of ids)
  9. shouldn't an Input have tasks attribute? (array of ids) (this would be a required attribute when creating an Input, because we can't create an orphan Input without a Task, or can we?)
terales commented 5 years ago

I'm answering with reference to sequence diagram at the end of "TAPICC system to system integrations" section.

should an Input have deliverables attribute? (would contain ids of associated Deliverables)

No, we're trying to go with a consistent approach of having all lists as REST endpoints to reuse paging from there. Here is a base comment about it: https://github.com/GALAglobal/TAPICC-API-implementation/issues/58#issuecomment-458406387

is this a complete list of possible fileType values for an Input? ['bitext', 'source', 'instructions', 'glossary', 'TM', 'reference', 'other']

This is not clear to me: if we deliver all files as ZIP archive, then all file types could be easily mixed together.

does Deliverable reuse the exact same list of values for fileType as an Input does ? (on Jim's screenshot there is also report should we include report also for Input?)

(same as no. 2) This is not clear to me: if we deliver all files as ZIP archive, then all file types could be easily mixed together.

should Input's fileType default to 'source' or be required?

(same as no. 2) This is not clear to me: if we deliver all files as ZIP archive, then all file types could be easily mixed together.

should Deliverable's fileType be required or use any default value? (I think it should be required)

(same as no. 2) This is not clear to me: if we deliver all files as ZIP archive, then all file types could be easily mixed together.

why would we support downloadLink if we are about to already support this endpoint GET /inputs/{inputId}/file and GET /deliverable/{deliverableId}/file isn't downloadLink unnecessarily redundant? Or would the value of downloadLink be that API endpoint to get the file?

I would like to download all input necessary for a task within one request, so GET /tasts/{taskId}/downloadInputs and GET /tasts/{taskId}/downloadDeliverables seems to be natural for me.

should Task have deliverables attribute? (array of ids) (similar question asked before, but this one would not populate the array with data, but only ids)

(same as no. 1) No, we're trying to go with a consistent approach of having all lists as REST endpoints to reuse paging from there. Here is a base comment about it: https://github.com/GALAglobal/TAPICC-API-implementation/issues/58#issuecomment-458406387

shouldn't a Task have inputs attribute? (array of ids)

(same as no. 1) No, we're trying to go with a consistent approach of having all lists as REST endpoints to reuse paging from there. Here is a base comment about it: https://github.com/GALAglobal/TAPICC-API-implementation/issues/58#issuecomment-458406387

shouldn't an Input have tasks attribute? (array of ids) (this would be a required attribute when creating an Input, because we can't create an orphan Input without a Task, or can we?)

(same as no. 1) No, we're trying to go with a consistent approach of having all lists as REST endpoints to reuse paging from there. Here is a base comment about it: https://github.com/GALAglobal/TAPICC-API-implementation/issues/58#issuecomment-458406387


So there are two main questions:

How to handle input downloads while attributing type of input to particular items (files or folders)?

If we allow requests like GET /tasts/{taskId}/downloadInputs and GET /tasts/{taskId}/downloadDeliverables then these ZIP archives should have folders inside based on input types.

Also, let's just drop the other option, because if you have something else you can always put it into instrustions or reference category.

Also, let's add report category and have the same list for Inputs and deliverables.

Should we have IDs of linked entities inline responses or everytime use separate endpoint?

Knowing the Andrew Gibbons concern about overflow of links on the one side and ease of use aor understanding on the other side I see this option:

Intoduce a parameter to disable arrays of ids for linked resources in responses while having separate endpoints, so concerned developers could implement integration without receiving arrays of IDs at all and developers with higher risk tolerance could work with arrays. Our API would support both and have higher risk tolerance by default.

Alino commented 5 years ago
  1. is this a complete list of possible fileType values for an Input? ['bitext', 'source', 'instructions', 'glossary', 'TM', 'reference', 'other']

This is not clear to me: if we deliver all files as ZIP archive, then all file types could be easily mixed together.

perhaps we can use these filetypes and add "zip" filetype as one of possible values if the content is mixed.

I would like to download all input necessary for a task within one request, so GET /tasts/{taskId}/downloadInputs and GET /tasts/{taskId}/downloadDeliverables seems to be natural for me.

I think each HTTP call can download only one file, so I would like to raise awareness here, that the TAPICC server would have to ZIP all Inputs or Deliverables which are associated with that Task and use it in the response for this request.

Also, let's just drop the other option

I agree

Also, let's add report category and have the same list for Inputs and deliverables.

I agree

Should we have IDs of linked entities inline responses or everytime use separate endpoint?

My opinion is that we should by default display the ids of linked entities, because it makes it possible to see the relationships between the objects. We can omit any data attribute using Query Parameters. So if the implementers are worried about crazy amounts of ids inside arrays (in my opinion it is unlikely), they can easily omit them. for example: GET /inputs/42/?omit=deliverables

Alino commented 5 years ago

I have updated swagger trying to match this model. Can you please review it and let me know what else needs to be changed? here is diff https://github.com/GALAglobal/TAPICC-API-implementation/commit/0e61cc44da450119cbe2746a3bcc20d0ace1295b here is swagger online https://app.swaggerhub.com/apis/Alino/tapicc-api/0.0.7