Closed jcompton-moravia closed 5 years ago
I love your idea, here is my attempt to create a more precise data model based on your idea (webhooks and all other data types are omitted for now):
Source file for the visualization in Visio 2016 format
Let's put a 65535 items restriction on the size of array identifiers. In the edge case of utf8 4 bytes chars used in id's for all 1024 characters, it would take up to 270 MB of data transfer or 540 MB if two references would be filled at maximum, but for 99.9% cases, it would be much less.
Also, we would introduce endpoints for quick polling< which would exclude all arrays, as well as special ones which would supply these arrays only with REST-based pagination:
/jobs/{Job.id}/meta (omits all references, only metadata)
/jobs/{Job.id}/inputs (respects REST pagination)
/jobs/{Job.id}/inputs/meta (get's all inputs with internal files only)
/jobs/{Job.id}/tasks/{Task.id}
/jobs/{Job.id}/tasks/{Task.id}/meta (omits all references, only metadata)
/jobs/{Job.id}/tasks/{Task.id}/inputs (respects REST pagination)
/jobs/{Job.id}/tasks/{Task.id}/inputs/meta (get's all inputs with internal files only)
/jobs/{Job.id}/tasks/{Task.id}/deliverables (respects REST pagination)
/jobs/{Job.id}/tasks/{Task.id}/deliverables/meta (get's all inputs with internal files only)
…
what if we have only instructions or links as Tasks input or deliverable?
We can just pass it as a text file and underlying HTTP protocol would handle all encoding, compression and type detection work.
Or even just date as Task deliverable?
Then we can change Task status and set deliveredDate
to the date we need.
Quick point: "Job" remains the loose binding that can associate Tasks together if they're related, for example, some translation activity that needs to be performed in multiple languages. A Task can be thought of as a discrete unit of work, executed by a single entity, as far as the Task creator is concerned. In a real-world workflow that Task can be broken up into child tasks and distributed to different executors by any system that can process it.
In a real-world workflow that Task can be broken up into child tasks and distributed to different executors by any system that can process it.
TAPICC API doesn't care: one TAPICC Job
is one transaction between two parties.
If Customer has a job inside their TMS
that have TaskA, TaskB and TaskC for three different vendors, then they would send three separate TAPICC Jobs
for each vendor where each TAPICC Job
would contain only those tasks which are available to appropriate Vendors. TAPICC API doesn't know anything or deal anyhow with the internal structure of a job inside their TMS
.
Also, Customer is not aware from TAPICC API about the internal processing of the assigned tasks performed by Vendor. For example, Translation task type is usually one task for Customer and a whole workflow for Vendor.
Jim, does it makes sense?
I have prepared few questions during the meeting, and then asked Jim for answers:
Deliverables are not created when the Task is created. They are created by the executor of the Task.
I suppose that we could have some method that deliberately creates an Input from a Deliverable, but I also think that we could just make this the responsibility of the workflow system. That is, its TAPICC's job to retrieve that file, but it is the backend system's responsibility to version control it. What are your thoughts about turning outputs into inputs for different Tasks? Should that be a function of TAPICC?
well both makes sense, it's hard for me to guess what is more convenient for the implementers or users
- What if the user needs to create a chain of Tasks that are sequential? Does he have to wait until the Deliverables are finished, so that he can create next Tasks? When there's a chain of tasks that are sequential, where that sequence is managed by a separate workflow system, and where the output of one task is required as the input of another, I would expect that either the next task wouldn't be created until the input for it was available, or it could be created but wouldn't be assigned until that input was available...
Andrew, are you comfortable with the answer to question number 1 ? I thought this might be in conflict with your idea about direct discovery of Assets somehow.
Reviewing @terales model, and considering a use case of a client asking for source files to be translated.
While some TMS/CMS systems do not have a specific "create job" task, this could be a kind of "mapping" that the underlying TMS/CMS system needs to do to confirm to the spec. I'm comfortable with that.
wrt scalability, I'd like to propose to drop all references to arrays of objects, and use only endpoints.
I'm not really understanding the following endpoints
/jobs/{Job.id}/inputs (respects REST pagination)
/jobs/{Job.id}/inputs/meta (get's all inputs with internal files only)
As these don't seem to conform to @terales model. Are they short cuts?
/jobs/{Job.id}/tasks/{Task.id}/meta (omits all references, only metadata)
/jobs/{Job.id}/tasks/{Task.id}
There are a number of endpoints which detail only meta data. What would that look like? How would they be different from the non meta endpoint?
One other point - once a task is finished, how does the next task know which deliverable to pick up from which previous task as an input?
@assembledStarDust
wrt scalability, I'd like to propose to drop all references to arrays of objects, and use only endpoints. As these don't seem to conform to @terales model. Are they short cuts? There are a number of endpoints which detail only meta data. What would that look like? How would they be different from the non meta endpoint?
Hm, seems ok to me. If we drop all references then to start working on job TAPICC vendor will need to make 5 requests:
/jobs/{Job.id}/ (omits all references, only metadata)
/jobs/{Job.id}/tasks/
/jobs/{Job.id}/tasks/{Task.id}/inputs
/jobs/{Job.id}/tasks/{Task.id}/inputs/{inputId}/files
/jobs/{Job.id}/tasks/{Task.id}/inputs/{inputId}/files/{fileId}/downloadfile
It has some overhead for small jobs with a single file, but it looks very maintainable and durable to handle different directions in evolution without breaking backward compatibility.
Here is an updated data model (updated to be consistent with draft in swagger): Source file for the visualization in Visio 2016 format
@terales looks good to me.
One other point - once a task is finished, how does the next task know which deliverable to pick up from which previous task as an input?
Considering that the inputs to any particular task be defined by the underlying system.
edit: actually already asked and answered as question 2 @Alino commented on Dec 5, 2018
Here is a list of things that are not yet clear to me:
Alex, thanks for the detailed review! I'm happy to see that we are making progress there.
- if the Input is a type of multiple "files" then why don't we just ZIP those files. In that case we don't need File model and the Input and Deliverable can have one file themselves. And we will have less API endpoints.
Agree, changed.
- what is the externalId on Input and Deliverable?
ID that can be used by Customer and Vendor systems. Agree, this one is redundant. Name is enough
- should Deliverable also have name attribute?
Yep, added.
- how are Inputs directly associated with Deliverables? I guess when you are uploading the Deliverable, you will need to specify Task.id and Input.id correct?
Yes, correct. Changed to "one-to-many" relation
- should Task have deliverables attribute? (would show all deliverables)
No, by convention this kind of lists are retrievable only by REST-endpoint. https://github.com/GALAglobal/TAPICC-API-implementation/issues/58#issuecomment-458406387
- should Input and Deliverable have encoding attribute?
ZIP and included files format that you've suggested should handle it, added a note about encoding filenames in UTF8 in ZIP.
- should Input and Deliverable have languageCode attribute?
There could be one or many languages inside. Let's keep languages info inside task (added sourceLang
attribute there).
Updated model: Source file for the visualization in Visio 2016 format
Thank you Sasha for your answers. I am trying implement this model in the codebase. During that process few more questions have risen:
['bitext', 'source', 'instructions', 'glossary', 'TM', 'reference', 'other']
report
should we include report also for Input?)GET /inputs/{inputId}/file
and GET /deliverable/{deliverableId}/file
isn't downloadLink unnecessarily redundant? Or would the value of downloadLink be that API endpoint to get the file?I'm answering with reference to sequence diagram at the end of "TAPICC system to system integrations" section.
should an Input have deliverables attribute? (would contain ids of associated Deliverables)
No, we're trying to go with a consistent approach of having all lists as REST endpoints to reuse paging from there. Here is a base comment about it: https://github.com/GALAglobal/TAPICC-API-implementation/issues/58#issuecomment-458406387
is this a complete list of possible fileType values for an Input?
['bitext', 'source', 'instructions', 'glossary', 'TM', 'reference', 'other']
This is not clear to me: if we deliver all files as ZIP archive, then all file types could be easily mixed together.
does Deliverable reuse the exact same list of values for fileType as an Input does ? (on Jim's screenshot there is also
report
should we include report also for Input?)
(same as no. 2) This is not clear to me: if we deliver all files as ZIP archive, then all file types could be easily mixed together.
should Input's fileType default to 'source' or be required?
(same as no. 2) This is not clear to me: if we deliver all files as ZIP archive, then all file types could be easily mixed together.
should Deliverable's fileType be required or use any default value? (I think it should be required)
(same as no. 2) This is not clear to me: if we deliver all files as ZIP archive, then all file types could be easily mixed together.
why would we support downloadLink if we are about to already support this endpoint
GET /inputs/{inputId}/file
andGET /deliverable/{deliverableId}/file
isn't downloadLink unnecessarily redundant? Or would the value of downloadLink be that API endpoint to get the file?
I would like to download all input necessary for a task within one request, so GET /tasts/{taskId}/downloadInputs
and GET /tasts/{taskId}/downloadDeliverables
seems to be natural for me.
should Task have deliverables attribute? (array of ids) (similar question asked before, but this one would not populate the array with data, but only ids)
(same as no. 1) No, we're trying to go with a consistent approach of having all lists as REST endpoints to reuse paging from there. Here is a base comment about it: https://github.com/GALAglobal/TAPICC-API-implementation/issues/58#issuecomment-458406387
shouldn't a Task have inputs attribute? (array of ids)
(same as no. 1) No, we're trying to go with a consistent approach of having all lists as REST endpoints to reuse paging from there. Here is a base comment about it: https://github.com/GALAglobal/TAPICC-API-implementation/issues/58#issuecomment-458406387
shouldn't an Input have tasks attribute? (array of ids) (this would be a required attribute when creating an Input, because we can't create an orphan Input without a Task, or can we?)
(same as no. 1) No, we're trying to go with a consistent approach of having all lists as REST endpoints to reuse paging from there. Here is a base comment about it: https://github.com/GALAglobal/TAPICC-API-implementation/issues/58#issuecomment-458406387
So there are two main questions:
If we allow requests like GET /tasts/{taskId}/downloadInputs
and GET /tasts/{taskId}/downloadDeliverables
then these ZIP archives should have folders inside based on input types.
Also, let's just drop the other
option, because if you have something else you can always put it into instrustions
or reference
category.
Also, let's add report
category and have the same list for Inputs and deliverables.
Knowing the Andrew Gibbons concern about overflow of links on the one side and ease of use aor understanding on the other side I see this option:
Intoduce a parameter to disable arrays of ids for linked resources in responses while having separate endpoints, so concerned developers could implement integration without receiving arrays of IDs at all and developers with higher risk tolerance could work with arrays. Our API would support both and have higher risk tolerance by default.
- is this a complete list of possible fileType values for an Input?
['bitext', 'source', 'instructions', 'glossary', 'TM', 'reference', 'other']
This is not clear to me: if we deliver all files as ZIP archive, then all file types could be easily mixed together.
perhaps we can use these filetypes and add "zip" filetype as one of possible values if the content is mixed.
I would like to download all input necessary for a task within one request, so
GET /tasts/{taskId}/downloadInputs
andGET /tasts/{taskId}/downloadDeliverables
seems to be natural for me.
I think each HTTP call can download only one file, so I would like to raise awareness here, that the TAPICC server would have to ZIP all Inputs or Deliverables which are associated with that Task and use it in the response for this request.
Also, let's just drop the
other
option
I agree
Also, let's add
report
category and have the same list for Inputs and deliverables.
I agree
Should we have IDs of linked entities inline responses or everytime use separate endpoint?
My opinion is that we should by default display the ids of linked entities, because it makes it possible to see the relationships between the objects.
We can omit any data attribute using Query Parameters.
So if the implementers are worried about crazy amounts of ids inside arrays (in my opinion it is unlikely), they can easily omit them.
for example:
GET /inputs/42/?omit=deliverables
I have updated swagger trying to match this model. Can you please review it and let me know what else needs to be changed? here is diff https://github.com/GALAglobal/TAPICC-API-implementation/commit/0e61cc44da450119cbe2746a3bcc20d0ace1295b here is swagger online https://app.swaggerhub.com/apis/Alino/tapicc-api/0.0.7
As per our WG#4 discussion on December 5th, we discussed an alteration to the data model that's based-on the idea of Inputs and Outputs that are connected via a Task. We're calling the outputs created by a Task "Deliverables" because that's by definition what a deliverable is. Inputs can be instructions, reference, glossary, TMs, or "files" that need to be transformed (includes JSON serialized XLIFF, of course).
A Task can have any number of Inputs (maybe even zero inputs if the "directive" can be defined by the metadata of a Task?), and can produce any number of Deliverables. Depending on the Task Type, those deliverables may or may not be a direct transform of an input. In the case where a Deliverable has a direct, transformative relationship with an Input file (e.g. "Charmander" turns into "Charmeleon"), that Deliverable is associated with an Input ID. Otherwise, the Deliverable is always associated with what inputs were used to create it because it's a child of a Task (that's associated with Inputs).
The idea of de-coupling inputs from tasks, that is, not dictating a strict parent/child taxonomy between tasks and files (or vice versa) is designed to make it easy for any CMS system, TMS system, or other system that endeavors to be part of a TAPICC ecosystem to be able to "map" their worldview to this data model. It is in direct response to feedback that previous data models were too rigid and in conflict with the data models of systems that TAPICC needs to connect to.
Please see a sketch of the data model below, and share your questions, concerns, and ideas. Thx!