Closed assembledStarDust closed 5 years ago
problem: I feel uncomfortable with a task holding potentially unlimited references to assets. proposed solution: don't have asset references in the task. Allow a direct discovery of assets, and a iteration to the leaf of assets. That way REST pagination and limits can manage a long list.
I am struggling to visualise this, how this should work, but this might be very good idea. Could you please create a visualisation for this, something like this? With data models, and their relationships? https://dbeaver.jkiss.org/product/dbeaver-ss-dark.png
I hope to improve this over time and feedback.
latest model.
Questions:
language
in task
is source or target?task
for language
does [0..1]
means that language
can be not set?language
in leafDocument
is source or target or either?EEnumerator
?updated graphic for:
[0..1] is 0 or 1 multiplicity.
EEnumerator is for Enumerate. I haven't figured out how to set Enumerate types yet, using Papyrus UML editor. In the meantime, the concept status ideas are:
job.status: { Inactive, Quote, Ready, In-progress, Complete } task.status: { Inactive, Quote, Ready, In-progress, Complete }
Asset.type: { Placeholder,Source, Reference } Leaf.type: { Placeholder,bilingual,target }
question:
What is the difference between leafDocument and Asset child, is it the same thing?
Perhaps we should focus here on what would be the expected response from the API, If the client makes a request to get all Assets.
Here is a scenario based on my understanding of this proposal: Given that there is 1 Source Asset in the database And there is 1 Reference Asset in the database And there are 2 children Assets in the database When the client makes a request to localhost:1337/assets Then the response from the server is
{
assets: [
{
uid: '1',
name: 'my first Source Asset',
type: 'source',
status: 'prep',
assets: [
{
uid: '3',
name: 'my first child Asset',
type: 'Final',
language: 'sk',
status: 'ready'
},
{
uid: '4',
name: 'my second child Asset',
type: 'WG3',
language: 'cz',
status: 'prep'
}
]
},
{
uid: '2',
name: 'my first Reference Asset',
type: 'Reference',
status: 'ready'
}
]
}
Please let me know, if my understanding of nesting these Assets is correct, or please correct my JSON response if it is wrong.
What is the difference between leafDocument and Asset child, is it the same thing?
yes, same thing.
asset can have many children. leaf can only have one parent and one (optional) child. Its anticipated that a branch is a file/language combination.
yes, json looks good. I see you've left out the optional date field, which is fine.
edit. lets add further children to leaf uid 4 and demonstrate the Json. I've needed to add two new fields to cater for the next child and leaf uids, and another convenience field to denote the leaf. These will help discovery of the leaf.
{
assets: [
{
uid: '1',
name: 'my first Source Asset',
type: 'source',
assets: [
{
uid: '3',
name: 'my first child Asset',
type: 'Final',
language: 'sk',
isLeaf: true
},
{
uid: '4',
name: 'my second child Asset',
type: 'WG3',
language: 'cz',
childUid: '5',
leafUid: '6',
isLeaf: false
}
]
},
{
uid: '2',
name: 'my first Reference Asset',
type: 'Reference',
}
]
}
leaf.status. The idea behind this was to denote when a file was being worked on by either the API or the underlying system. the status would be prep when the file was not ready for download, and ready when it was. However, I'm wondering if that's superfluous, as the same functionality could be handled by simply not showing the leaf in the first place, until the leaf is ready.
task.putLeaf() function. There was a comment on the last call to the effect that putting a completed WG3 package back in the right branch may be a problem. I agree that this could be problematic. Any system that accepts user input must account for bad input. Key check would be uid embedded in the file.
security. Default is open. Some implementers may prefer, for example, that users do not have access to assets or branches other than their own language. In that case, implement Oauth scopes.
If we go back to the original model which is currently in swagger, the same scenario as I have described above. Would have this JSON response
{
assets: [
{
id: '1',
jobId: '1',
name: 'my first Source Asset',
sourceLanguage: 'en',
encoding: 'utf8',
fileOriginalName: 'fileToTranslate.txt',
tasks: [
{
id: '1',
type: 'translation',
progress: 'finished',
targetLanguage: 'sk',
assignedTo: 'symfonie.com/1223214',
assetId: 1
},
{
id: '2',
type: 'translation',
progress: 'pending',
targetLanguage: 'cz',
assignedTo: 'symfonie.com/1223214',
assetId: 1
}
]
},
{
id: '2',
isReference: true,
sourceLanguage: 'en',
encoding: 'utf8',
fileOriginalName: 'refferenceText.pdf'
}
]
}
This old model for me personally is easier to reason about. Tasks here contain the same data as leafDocument. Isn't the same thing achieved already in the old model? Probably I am missing the problem here, or the meaning of "direct discovery of Assets" ? Could you explain please the problem we would get into using this old model?
The root of the problem is that we are attempting to map an API to many different TMS/CMS systems that are available. The graphic illustrates a couple of different TMS configurations I've seen. The configuration on the left easily maps to the old data model, but the configuration on the right does not.
What I'm suggesting by "discovery" is that the model we choose has enough in it to cover both example TMS configurations and any user of the API can "discover" the underlying configuration, whatever it is.
Perhaps I am missing what is the expected result of "mapping an API to many different TMS/CMS". Is the expected result a transformation/replica of the data from the TAPICC API in the TMS/CMS system?
If yes, then I don't see a problem with our old model, because the TMS/CMS can accept an Asset and a Task which are associated together.
The problem can occur if we were doing the opposite, if we want to transform/replica data from TMS/CMS to the TAPICC server. In that case, the implementer would need to write a map function, which would split a Task which has multiple Assets, into as many Tasks as there are Assets associated with it. So that it can map to TAPICC data model.
It's not really hard to create such a map function, only few lines of code.
Indeed, writing a mapping function can be done. But it goes back to my original point, that I fear such a configuration that does not easily map to the underlying system will impede adoption by the industry.
But is that really an obstacle for the implementers? This mapping requirement could be documented. Here is an extremely simple example of how this mapping could be done in javascript, with just single line of code (the last one).
const assets = [{id: 1}, {id: 2}, {id: 3}]
const task = {id: 1, assetIds: [1, 2, 3]}
const split_tasks = task.assetIds.map((t) => { return { assetId: t}})
https://jsfiddle.net/jy38kzr1/
I would rather have a data model which is easier to reason about at the cost of having the implementer map the task to tasks, than increasing the complexity of the data model so that the implementer can map task more easily.
But this is just my opinion, let's hear from others as well
Search your email for another dissenting voice, subject heading on the thread as:
TAPICC WG #4 - question from Jost
Unfortunately I cannot find anything in that email thread which would be related to this topic. Maybe I am missing some email in that thread. Could you please quote it here, also for others who don't have that email too?
Thank you
I have obtained an ok from the author to post this snippet from email.
Hi,
I joined the TAPICC initiative hoping to develop a standardized connector for desktop CAT tools and web-based TMS. Unfortunately, in last WG4 meeting it was clear to me the group goal does not contemplate connecting a CAT tool with a TMS.
It seems that the API is being designed to be used only by a CMS and a TMS that belong to the same company, not as an API that would let systems from different vendors interconnect.
On the initial meetings someone mentioned that the goal was to have something similar or better than COTI but with friendlier legal framework that would be more open. The impression I got in the last WG4 meeting was that the original goal was lost.
I hope to be wrong and in the future there would be an open TAPICC API that lets different kinds of systems interact in an standardized way.
Regards,
Rodolfo
some ideas for endpoints
edit to add/change some endpoints
GET /job returns a list of jobs
GET /job/<jobUid>/task returns a list of tasks
GET /job/<jobUid>/task/<taskUid> returns a task
GET /job/<jobUid>/asset returns a list of assets
GET /job/<jobUid>/asset/<assetUid> returns a list of children assets
GET /job/<jobUid>/asset/<assetUid>/download download asset
GET /job/<jobUid>/asset/leaf returns a list of leaf assets
GET /job/<jobUid>/asset/<assetUid>/leaf returns a list of leaf assets for that assetUid
GET /job/<jobUid>/asset/<assetUid>/leaf/<leafUid> returns a leaf asset.
GET /job/<jobUid>/asset/leaf/<leafUid>/download downloads a leaf asset.
GET /job/<jobUid>/task/<taskGroupUid>/leaf returns a list of leaf assets for that task group
PUT /job/<jobUid>/asset/leaf/<leafUid>/upload uploads a leaf asset. This file then becomes the leaf.
PUT /job/<jobUid>/asset/leaf/<leafUid>/uploadTarget uploads a target asset. This file then becomes the leaf.
Nice list!
I feel uncomfortable with the terms asset
and leaf
-- it's not clear what they mean to me. Could we use the usual language there (packages
and files
)?
GET /job returns a list of jobs
GET /job/<jobUid>/task returns a list of tasks
GET /job/<jobUid>/task/<taskUid> returns a task
GET /job/<jobUid>/package returns a list of packages
GET /job/<jobUid>/package/<fileUid> returns a list of children packages
GET /job/<jobUid>/package/file returns a list of all files in all packages
GET /job/<jobUid>/package/<fileUid>/file returns a list of file for that package
GET /job/<jobUid>/package/<fileUid>/file/<fileUid> returns a file.
GET /job/<jobUid>/package/file/<fileUid>/download downloads a file.
GET /job/<jobUid>/task/<taskUid>/file returns a list of files for that task.
PUT /job/<jobUid>/package/file/<fileUid>/upload uploads a file.
Is there an anticipation that a package contains a number of files?
I'm coming from the SDL Trados world and there it's default meaning as well as in MemoQ, Déjà Vu and Wordfast.
From Wordfast help:
A package file contains all of the required information to work on a translation project. Using a package file, complete or incomplete projects can be shared between different Wordfast Pro users.
While assets
are more linked to the software world, like assets in C#, Java or Unity.
FYI: We had some discussion about "packages" before in issue #22 .
I think this is all starting to make more sense to me. Incredible how changing the names of these objects can make it so different and more comfortable.
Please correct me if I am wrong:
Package
is a container of Files
Package
can have child Packages
? If yes, can the parent Package
have Files
as siblings?File
with an already existing Package
or create a new Package
. There cannot be a File
without a Package
.Task
can be associated with a Package
. If this is true, then I can see few problems with this. One of my concerns is updating the Package
by adding more Files
into it.
Another problem is that enabling this, would create a fog in which you cannot see the progress of particular items of the Package
.Task
can be associated with many Files
. Similar to point number 4. I would rather propose that a Task
will have Task.groupId
which would enable you to group separate Task
objects together, yet still each Task
would point to only one File
.here is a data model typescript definition of a Package
based on my current understanding:
type Package = {
id: number,
jobId: number,
files: Array<number>,
packages: Array<number>,
createdAt: Date,
updatedAt: Date
}
here is File
data model type
type File = {
id: number,
fileDescriptor: string, // unique name of the file. (auto-generated)
fileOriginalName: string, // original name of the file as uploaded. (auto-filled)
isReference: boolean, // if is set to true, then the File is not suppossed to be actionable
sourceLanguage: string,
encoding: string,
packageId: number,
tasks: Array<number>,
createdAt: Date,
updatedAt: Date
}
@ysavourel thanks for a link!
Packages here are just containers and if we have the package from SDL Trados Studio it would be still the file in our notation. Like you said in #22: "that can be handled at a different level" (receiving system should deal with it in any way it wants: parse into separate files or put into some folder as is).
What do you think about it?
- A
Package
can have childPackages
? If yes, can the parentPackage
haveFiles
as siblings?
In my world no. If you need several source packages per job you should add them separately. A plain structure is much simpler to consume, to filter, to map to anything and to understand.
- If you want to upload a file, within one API request, you would need to associate that
File
with an already existingPackage
or create a newPackage
. There cannot be aFile
without aPackage
.
Yes, because it's not clear where you're adding this file: into a source package or into one of the return packages.
- A
Task
can be associated with aPackage
. If this is true, then I can see few problems with this. One of my concerns is updating thePackage
by adding moreFiles
into it. Another problem is that enabling this, would create a fog in which you cannot see the progress of particular items of thePackage
.- A
Task
can be associated with manyFiles
. Similar to point number 4. I would rather propose that aTask
will haveTask.groupId
which would enable you to group separateTask
objects together, yet still eachTask
would point to only oneFile
.
I'm not sure about this one. Do you have any other concerns except status check, if not then maybe we can have a simpler solution?
here is a data model typescript definition of a
Package
based on my current understanding:type Package = { id: number, files: Array<number>, packages: Array<number>, createdAt: Date, updatedAt: Date }
here is
File
data model typetype File = { id: number, fileDescriptor: string, // unique name of the file. (auto-generated) fileOriginalName: string, // original name of the file as uploaded. (auto-filled) isReference: boolean, // if is set to true, then the File is not suppossed to be actionable sourceLanguage: string, encoding: string, packageId: number, tasks: Array<number>, createdAt: Date, updatedAt: Date }
I'm not comfortable with id
being numbers -- I don't see how I would be able to reuse UUIDs or proprietary alphanumeric ids that some systems provide for files (they may be forced to use UUIDs because of distributed file upload systems).
Date
-- is it from RFC 3339 (ex 1990-12-31T23:59:60Z
)?
sourceLanguage: string
-- is it from BCP 47 (ex zh
, en-US
, mn-Cyrl-MN
)?
In my world no. If you need several source packages per job you should add them separately. A plain structure is much simpler to consume, to filter, to map to anything and to understand.
good, I agree.
I'm not sure about this one. Do you have any other concerns except status check, if not then maybe we can have a simpler solution?
my concern is not only status check of particular Files
(common issue with point 4. and 5.), but also handling updates when new Files are uploaded to Package (issue related to point 4. only)
I see the reason why we would want to enable 4. and 5. is to make the mapping of the data structure compatible with systems that are doing this. In my opinion this is bad design of those systems. But shall we inherit it? If yes, then I guess we have to accept the fact, that we will lose the ability to check status on particular Files within a group.
Probably the fix to a problem with handling updates in point 4. would be to not to implement it at all. IF someone would want to group a Task to all Files inside a Package, it could be done by grouping Task to Files instead and not the whole Package.
I'm not comfortable with id being numbers -- I don't see how I would be able to reuse UUIDs or proprietary alphanumeric ids that some systems provide for files (they may be forced to use UUIDs because of distributed file upload systems).
Agree, makes sense to make them strings.
Date -- is it from RFC 3339 (ex 1990-12-31T23:59:60Z)?
I think yes, I wanted to reuse the same format that we currently use in the current implementation, e.g.:
"createdAt": "2018-11-20T19:55:44.055Z"
sourceLanguage: string -- is it from BCP 47 (ex zh, en-US, mn-Cyrl-MN)?
I don't know, can be decided and mentioned in the documentation.
Great! So we agree on all points here.
Probably the fix to a problem with handling updates in point 4. would be to not to implement it at all. IF someone would want to group a Task to all Files inside a Package, it could be done by grouping Task to Files instead and not the whole Package.
yes, and it's well defined in your data structures: File
has backlinks to Task
and Package
doesn't. So Task
should reference either particular files or package? This seems something to think more about because I was thinking to use packages as these groups for tasks.
In this case, it's clear that packages (or assets) are redundant and the only grouping mechanism we need are tasks themselves!
So structure would be like this: Job
consists of Tasks
which links to source and target File
s (assuming that reference File
s are available to everyone)!
Job
creation time:@Alino are these assumptions right from your point of view?
So Task should reference either particular files or package? This seems something to think more about because I was thinking to use packages as these groups for tasks.
I am rather in favour of associating Task with Files, and not to make it possible to associate a Task with a Package, due to the updates handling issue...
In this case, it's clear that packages (or assets) are redundant and the only grouping mechanism we need are tasks themselves!
These were my thoughts from the beginning, but then I though we wanted to add Packages as another container to improve on mapping with existing systems somehow. I didn't understand that they were introduced just to solve grouping of Files to a Task.
So structure would be like this: Job consists of Tasks which links to source and target Files (assuming that reference Files are available to everyone)!
I would say this is what we have with current data model in swagger. The only difference is that there is currently one to one relationship between Task and Asset(File) And it seems we want to change it so that one Task can point to many Assets(Files).
So we are NOT making workflows -- tasks could be derived from files which existed during Job creation time:
I agree.
I think an output of a finished Task
should be a new File
.
And they could be associated for example with Task.sourceFileId and Task.targetFileId
Here is an example response from the API call to /files
Note, that there is a finished Task which has targetFileId: 3, and that File with id 3 has been created and is listed in Files.
{
files: [
{
id: '1',
jobId: '1',
sourceLanguage: 'en',
encoding: 'utf8',
fileOriginalName: 'fileToTranslate.txt',
tasks: [
{
id: '1',
jobId: '1',
type: 'translation',
progress: 'finished',
targetLanguage: 'sk',
assignedTo: 'symfonie.com/1223214',
sourceFileId: 1,
targetFileId: 3
},
{
id: '2',
jobId: '1',
type: 'translation',
progress: 'pending',
targetLanguage: 'cz',
assignedTo: 'symfonie.com/1223214',
sourceFileId: 1,
targetFileId: null
}
]
},
{
id: '2',
isReference: true,
sourceLanguage: 'en',
encoding: 'utf8',
fileOriginalName: 'refferenceText.pdf'
},
{
id: '3',
sourceLanguage: 'sk',
encoding: 'utf8',
fileOriginalName: 'fileToTranslate_sk.txt'
}
]
}
but this example is still assuming that there is one to one relationship between Task and File. In theory we could change these properties (sourceFileId and targetFileId) to be arrays of ids. But maybe we could consider adding Task.groupId which would enable us to maintain one to one relationship, while allowing us to group Tasks.
Nice, I love it how it presented (with addition of Task
links to many File
s)!
I still don't see why we may need this groupId
property. If we need to provide some way to group files for edge case they can do it via (string) File.relativePath
.
I would like to introduce this property either way, so TAPICC hosts would be able to rebuild original folder structure without need of maintaining tree structure). Or relative path is preserved in fileOriginalName
?
updated model to add fields
asset.originalPath
asset.description
asset.sourceUid for recording where the asset came from, in the case of a transfer between systems.
job.description
job.financial - CAT model info, quote price etc. String of whatever.
job.instructions
job.supplimental- User info of whatever. String of whatever
job.sourceUid
task.description
task.financial
task.instructions
task.supplimental
task.taskGroupId
task.sourceUid
leaf.taskGroupId
leaf.description
removed leaf.taskUid
Concept of taskGroupUid. when a task is updated with a (bilingual) file, how does the next task know what file to download? The taskGroupId is a grouping of tasks that might need to be done in an order (dictated by the underlying system) so that when a particular task becomes active, it refers to the taskGroupId to query to find the latest leaf in the task group.
remove
GET /job/<jobUid>/task/<taskUid>/leaf returns a list of leaf assets for that task.
add
GET /job/<jobUid>/task/<taskGroupUid>/leaf returns a list of leaf WG3 bilingual (or final target) for that task group.
there is a new proposal related to data models here https://github.com/GALAglobal/TAPICC-API-implementation/issues/58
@assembledStarDust can we close this proposal in favor of #58?
yeap.
Referring to issue #18 there is a comment I'd like to expand and explore.
For some TMS/CMS systems, the model of a tree structure of asset to tasks will map well, with little processing to map thier existing internal structure onto the API. Others will need a lot of processing to map their structure.
I fear that this will impede adoption by the industry.
I would propose an alterative that might be more flexible. I acknowledge that this is only an idea at this minute and there may be inconsistancies.
The concept would be a tree structure of assets. Source assets would be at the root, with the various payload assets for each language as children. Tasks could reference any child and it would be possible to iterate through to the leaf or parent as required. I would expect that the task would want to be referencing the leaf in most cases.
A source file may have the WG3 payload for each language as children, and each payload may have the translated payload as children, and so on, until the final translated document at the leaf.
Tasks would typically reference the leaf file, and would add any uploaded documents onto the asset as a child. Tasks would be able to reference as many leaf files or as little as the implementer would want.
Taking the example of, say, 500 source files to translate. For one TMS/CMS implementation which may have an internal structure of one to one mapping with task and file there would be 500 source files, 500 payload files, and 500 tasks to reference each of the 500 files. Another TMS/CMS implementation might be to have 500 source files, and a single task to reference all 500 payload files.
Is this any better than the current model were we have tasks as children of assets?