GALAglobal / TAPICC-API-implementation

TAPICC API implementation using node.js framework sails.js
Other
6 stars 1 forks source link

Proposal: Tree structure of assets #57

Closed assembledStarDust closed 5 years ago

assembledStarDust commented 6 years ago

Referring to issue #18 there is a comment I'd like to expand and explore.

I’m just wondering how well that maps to existing CAT/TMS systems.

For some TMS/CMS systems, the model of a tree structure of asset to tasks will map well, with little processing to map thier existing internal structure onto the API. Others will need a lot of processing to map their structure.

I fear that this will impede adoption by the industry.

I would propose an alterative that might be more flexible. I acknowledge that this is only an idea at this minute and there may be inconsistancies.

The concept would be a tree structure of assets. Source assets would be at the root, with the various payload assets for each language as children. Tasks could reference any child and it would be possible to iterate through to the leaf or parent as required. I would expect that the task would want to be referencing the leaf in most cases.

A source file may have the WG3 payload for each language as children, and each payload may have the translated payload as children, and so on, until the final translated document at the leaf.

Tasks would typically reference the leaf file, and would add any uploaded documents onto the asset as a child. Tasks would be able to reference as many leaf files or as little as the implementer would want.

Taking the example of, say, 500 source files to translate. For one TMS/CMS implementation which may have an internal structure of one to one mapping with task and file there would be 500 source files, 500 payload files, and 500 tasks to reference each of the 500 files. Another TMS/CMS implementation might be to have 500 source files, and a single task to reference all 500 payload files.

Is this any better than the current model were we have tasks as children of assets?

assembledStarDust commented 6 years ago

problem: I feel uncomfortable with a task holding potentially unlimited references to assets. proposed solution: don't have asset references in the task. Allow a direct discovery of assets, and a iteration to the leaf of assets. That way REST pagination and limits can manage a long list.

Alino commented 5 years ago

I am struggling to visualise this, how this should work, but this might be very good idea. Could you please create a visualisation for this, something like this? With data models, and their relationships? https://dbeaver.jkiss.org/product/dbeaver-ss-dark.png

assembledStarDust commented 5 years ago

I hope to improve this over time and feedback.

latest model.

wg4

ysavourel commented 5 years ago

Questions:

assembledStarDust commented 5 years ago

updated graphic for:

[0..1] is 0 or 1 multiplicity.

EEnumerator is for Enumerate. I haven't figured out how to set Enumerate types yet, using Papyrus UML editor. In the meantime, the concept status ideas are:

job.status: { Inactive, Quote, Ready, In-progress, Complete } task.status: { Inactive, Quote, Ready, In-progress, Complete }

Asset.type: { Placeholder,Source, Reference } Leaf.type: { Placeholder,bilingual,target }

assembledStarDust commented 5 years ago

question:

Alino commented 5 years ago

What is the difference between leafDocument and Asset child, is it the same thing?

Perhaps we should focus here on what would be the expected response from the API, If the client makes a request to get all Assets.

Here is a scenario based on my understanding of this proposal: Given that there is 1 Source Asset in the database And there is 1 Reference Asset in the database And there are 2 children Assets in the database When the client makes a request to localhost:1337/assets Then the response from the server is

{
  assets: [
    {
      uid: '1',
      name: 'my first Source Asset',
      type: 'source',
      status: 'prep',
      assets: [
        {
          uid: '3',
          name: 'my first child Asset',
          type: 'Final',
          language: 'sk',
          status: 'ready'
        },
        {
          uid: '4',
          name: 'my second child Asset',
          type: 'WG3',
          language: 'cz',
          status: 'prep'
        }
      ]
    },
    {
      uid: '2',
      name: 'my first Reference Asset',
      type: 'Reference',
      status: 'ready'
    }
  ]
}

Please let me know, if my understanding of nesting these Assets is correct, or please correct my JSON response if it is wrong.

assembledStarDust commented 5 years ago

What is the difference between leafDocument and Asset child, is it the same thing?

yes, same thing.

asset can have many children. leaf can only have one parent and one (optional) child. Its anticipated that a branch is a file/language combination.

yes, json looks good. I see you've left out the optional date field, which is fine.

edit. lets add further children to leaf uid 4 and demonstrate the Json. I've needed to add two new fields to cater for the next child and leaf uids, and another convenience field to denote the leaf. These will help discovery of the leaf.

{
  assets: [
    {
      uid: '1',
      name: 'my first Source Asset',
      type: 'source',
      assets: [
        {
          uid: '3',
          name: 'my first child Asset',
          type: 'Final',
          language: 'sk',
          isLeaf: true
        },
        {
          uid: '4',
          name: 'my second child Asset',
          type: 'WG3',
          language: 'cz',
          childUid: '5',
          leafUid: '6',
          isLeaf: false
        }
      ]
    },
    {
      uid: '2',
      name: 'my first Reference Asset',
      type: 'Reference',
    }
  ]
}
assembledStarDust commented 5 years ago

leaf.status. The idea behind this was to denote when a file was being worked on by either the API or the underlying system. the status would be prep when the file was not ready for download, and ready when it was. However, I'm wondering if that's superfluous, as the same functionality could be handled by simply not showing the leaf in the first place, until the leaf is ready.

assembledStarDust commented 5 years ago

task.putLeaf() function. There was a comment on the last call to the effect that putting a completed WG3 package back in the right branch may be a problem. I agree that this could be problematic. Any system that accepts user input must account for bad input. Key check would be uid embedded in the file.

assembledStarDust commented 5 years ago

security. Default is open. Some implementers may prefer, for example, that users do not have access to assets or branches other than their own language. In that case, implement Oauth scopes.

Alino commented 5 years ago

If we go back to the original model which is currently in swagger, the same scenario as I have described above. Would have this JSON response

{
  assets: [
    {
      id: '1',
      jobId: '1',
      name: 'my first Source Asset',
      sourceLanguage: 'en',
      encoding: 'utf8',
      fileOriginalName: 'fileToTranslate.txt',
      tasks: [
        {
          id: '1',
          type: 'translation',
          progress: 'finished',
          targetLanguage: 'sk',
          assignedTo: 'symfonie.com/1223214',
          assetId: 1
        },
        {
          id: '2',
          type: 'translation',
          progress: 'pending',
          targetLanguage: 'cz',
          assignedTo: 'symfonie.com/1223214',
          assetId: 1
        }
      ]
    },
    {
      id: '2',
      isReference: true,
      sourceLanguage: 'en',
      encoding: 'utf8',
      fileOriginalName: 'refferenceText.pdf'
    }
  ]
}

This old model for me personally is easier to reason about. Tasks here contain the same data as leafDocument. Isn't the same thing achieved already in the old model? Probably I am missing the problem here, or the meaning of "direct discovery of Assets" ? Could you explain please the problem we would get into using this old model?

assembledStarDust commented 5 years ago

The root of the problem is that we are attempting to map an API to many different TMS/CMS systems that are available. The graphic illustrates a couple of different TMS configurations I've seen. The configuration on the left easily maps to the old data model, but the configuration on the right does not. class_diagram

What I'm suggesting by "discovery" is that the model we choose has enough in it to cover both example TMS configurations and any user of the API can "discover" the underlying configuration, whatever it is.

Alino commented 5 years ago

Perhaps I am missing what is the expected result of "mapping an API to many different TMS/CMS". Is the expected result a transformation/replica of the data from the TAPICC API in the TMS/CMS system?

If yes, then I don't see a problem with our old model, because the TMS/CMS can accept an Asset and a Task which are associated together.

The problem can occur if we were doing the opposite, if we want to transform/replica data from TMS/CMS to the TAPICC server. In that case, the implementer would need to write a map function, which would split a Task which has multiple Assets, into as many Tasks as there are Assets associated with it. So that it can map to TAPICC data model.

It's not really hard to create such a map function, only few lines of code.

assembledStarDust commented 5 years ago

Indeed, writing a mapping function can be done. But it goes back to my original point, that I fear such a configuration that does not easily map to the underlying system will impede adoption by the industry.

Alino commented 5 years ago

But is that really an obstacle for the implementers? This mapping requirement could be documented. Here is an extremely simple example of how this mapping could be done in javascript, with just single line of code (the last one).

const assets = [{id: 1}, {id: 2}, {id: 3}]
const task = {id: 1, assetIds: [1, 2, 3]}
const split_tasks = task.assetIds.map((t) => { return { assetId: t}})

https://jsfiddle.net/jy38kzr1/


I would rather have a data model which is easier to reason about at the cost of having the implementer map the task to tasks, than increasing the complexity of the data model so that the implementer can map task more easily.

But this is just my opinion, let's hear from others as well

assembledStarDust commented 5 years ago

Search your email for another dissenting voice, subject heading on the thread as:

TAPICC WG #4 - question from Jost

Alino commented 5 years ago

Unfortunately I cannot find anything in that email thread which would be related to this topic. Maybe I am missing some email in that thread. Could you please quote it here, also for others who don't have that email too?

Thank you

assembledStarDust commented 5 years ago

I have obtained an ok from the author to post this snippet from email.

Hi,

I joined the TAPICC initiative hoping to develop a standardized connector for desktop CAT tools and web-based TMS. Unfortunately, in last WG4 meeting it was clear to me the group goal does not contemplate connecting a CAT tool with a TMS.

It seems that the API is being designed to be used only by a CMS and a TMS that belong to the same company, not as an API that would let systems from different vendors interconnect.

On the initial meetings someone mentioned that the goal was to have something similar or better than COTI but with friendlier legal framework that would be more open. The impression I got in the last WG4 meeting was that the original goal was lost.

I hope to be wrong and in the future there would be an open TAPICC API that lets different kinds of systems interact in an standardized way.

Regards,

Rodolfo

assembledStarDust commented 5 years ago

some ideas for endpoints

edit to add/change some endpoints

GET /job returns a list of jobs

GET /job/<jobUid>/task returns a list of tasks

GET /job/<jobUid>/task/<taskUid> returns a task

GET /job/<jobUid>/asset returns a list of assets

GET /job/<jobUid>/asset/<assetUid> returns a list of children assets

GET /job/<jobUid>/asset/<assetUid>/download download asset

GET /job/<jobUid>/asset/leaf returns a list of leaf assets

GET /job/<jobUid>/asset/<assetUid>/leaf returns a list of leaf assets for that assetUid

GET /job/<jobUid>/asset/<assetUid>/leaf/<leafUid> returns a leaf asset.

GET /job/<jobUid>/asset/leaf/<leafUid>/download downloads a leaf asset.

GET /job/<jobUid>/task/<taskGroupUid>/leaf returns a list of leaf assets for that task group

PUT /job/<jobUid>/asset/leaf/<leafUid>/upload uploads a leaf asset. This file then becomes the leaf.

PUT /job/<jobUid>/asset/leaf/<leafUid>/uploadTarget uploads a target asset. This file then becomes the leaf.
terales commented 5 years ago

Nice list!

I feel uncomfortable with the terms asset and leaf -- it's not clear what they mean to me. Could we use the usual language there (packages and files)?

GET /job returns a list of jobs

GET /job/<jobUid>/task returns a list of tasks

GET /job/<jobUid>/task/<taskUid> returns a task

GET /job/<jobUid>/package returns a list of packages

GET /job/<jobUid>/package/<fileUid> returns a list of children packages

GET /job/<jobUid>/package/file returns a list of all files in all packages

GET /job/<jobUid>/package/<fileUid>/file returns a list of file for that package

GET /job/<jobUid>/package/<fileUid>/file/<fileUid> returns a file.

GET /job/<jobUid>/package/file/<fileUid>/download downloads a file.

GET /job/<jobUid>/task/<taskUid>/file returns a list of files for that task.

PUT /job/<jobUid>/package/file/<fileUid>/upload uploads a file.
assembledStarDust commented 5 years ago

Is there an anticipation that a package contains a number of files?

terales commented 5 years ago

I'm coming from the SDL Trados world and there it's default meaning as well as in MemoQ, Déjà Vu and Wordfast.

From Wordfast help:

A package file contains all of the required information to work on a translation project. Using a package file, complete or incomplete projects can be shared between different Wordfast Pro users.


While assets are more linked to the software world, like assets in C#, Java or Unity.

ysavourel commented 5 years ago

FYI: We had some discussion about "packages" before in issue #22 .

Alino commented 5 years ago

I think this is all starting to make more sense to me. Incredible how changing the names of these objects can make it so different and more comfortable.

Please correct me if I am wrong:

  1. So a Package is a container of Files
  2. A Package can have child Packages? If yes, can the parent Package have Files as siblings?
  3. If you want to upload a file, within one API request, you would need to associate that File with an already existing Package or create a new Package. There cannot be a File without a Package.
  4. A Task can be associated with a Package. If this is true, then I can see few problems with this. One of my concerns is updating the Package by adding more Files into it. Another problem is that enabling this, would create a fog in which you cannot see the progress of particular items of the Package.
  5. A Task can be associated with many Files. Similar to point number 4. I would rather propose that a Task will have Task.groupId which would enable you to group separate Task objects together, yet still each Task would point to only one File.

here is a data model typescript definition of a Package based on my current understanding:

type Package = {
  id: number,
  jobId: number,
  files: Array<number>,
  packages: Array<number>,
  createdAt: Date,
  updatedAt: Date
}

here is File data model type

type File = {
  id: number,
  fileDescriptor: string, // unique name of the file. (auto-generated)
  fileOriginalName: string, // original name of the file as uploaded. (auto-filled)
  isReference: boolean, // if is set to true, then the File is not suppossed to be actionable
  sourceLanguage: string,
  encoding: string,
  packageId: number,
  tasks: Array<number>,
  createdAt: Date,
  updatedAt: Date
}
terales commented 5 years ago

@ysavourel thanks for a link!

Packages here are just containers and if we have the package from SDL Trados Studio it would be still the file in our notation. Like you said in #22: "that can be handled at a different level" (receiving system should deal with it in any way it wants: parse into separate files or put into some folder as is).

What do you think about it?

terales commented 5 years ago
  1. A Package can have child Packages? If yes, can the parent Package have Files as siblings?

In my world no. If you need several source packages per job you should add them separately. A plain structure is much simpler to consume, to filter, to map to anything and to understand.


  1. If you want to upload a file, within one API request, you would need to associate that File with an already existing Package or create a new Package. There cannot be a File without a Package.

Yes, because it's not clear where you're adding this file: into a source package or into one of the return packages.


  1. A Task can be associated with a Package. If this is true, then I can see few problems with this. One of my concerns is updating the Package by adding more Files into it. Another problem is that enabling this, would create a fog in which you cannot see the progress of particular items of the Package.
  2. A Task can be associated with many Files. Similar to point number 4. I would rather propose that a Task will have Task.groupId which would enable you to group separate Task objects together, yet still each Task would point to only one File.

I'm not sure about this one. Do you have any other concerns except status check, if not then maybe we can have a simpler solution?


here is a data model typescript definition of a Package based on my current understanding:

type Package = {
  id: number,
  files: Array<number>,
  packages: Array<number>,
  createdAt: Date,
  updatedAt: Date
}

here is File data model type

type File = {
  id: number,
  fileDescriptor: string, // unique name of the file. (auto-generated)
  fileOriginalName: string, // original name of the file as uploaded. (auto-filled)
  isReference: boolean, // if is set to true, then the File is not suppossed to be actionable
  sourceLanguage: string,
  encoding: string,
  packageId: number,
  tasks: Array<number>,
  createdAt: Date,
  updatedAt: Date
}

I'm not comfortable with id being numbers -- I don't see how I would be able to reuse UUIDs or proprietary alphanumeric ids that some systems provide for files (they may be forced to use UUIDs because of distributed file upload systems).

Date -- is it from RFC 3339 (ex 1990-12-31T23:59:60Z)?

sourceLanguage: string -- is it from BCP 47 (ex zh, en-US, mn-Cyrl-MN)?

Alino commented 5 years ago

In my world no. If you need several source packages per job you should add them separately. A plain structure is much simpler to consume, to filter, to map to anything and to understand.

good, I agree.

I'm not sure about this one. Do you have any other concerns except status check, if not then maybe we can have a simpler solution?

my concern is not only status check of particular Files (common issue with point 4. and 5.), but also handling updates when new Files are uploaded to Package (issue related to point 4. only) I see the reason why we would want to enable 4. and 5. is to make the mapping of the data structure compatible with systems that are doing this. In my opinion this is bad design of those systems. But shall we inherit it? If yes, then I guess we have to accept the fact, that we will lose the ability to check status on particular Files within a group.

Probably the fix to a problem with handling updates in point 4. would be to not to implement it at all. IF someone would want to group a Task to all Files inside a Package, it could be done by grouping Task to Files instead and not the whole Package.

I'm not comfortable with id being numbers -- I don't see how I would be able to reuse UUIDs or proprietary alphanumeric ids that some systems provide for files (they may be forced to use UUIDs because of distributed file upload systems).

Agree, makes sense to make them strings.

Date -- is it from RFC 3339 (ex 1990-12-31T23:59:60Z)?

I think yes, I wanted to reuse the same format that we currently use in the current implementation, e.g.: "createdAt": "2018-11-20T19:55:44.055Z"

sourceLanguage: string -- is it from BCP 47 (ex zh, en-US, mn-Cyrl-MN)?

I don't know, can be decided and mentioned in the documentation.

terales commented 5 years ago

Great! So we agree on all points here.

Probably the fix to a problem with handling updates in point 4. would be to not to implement it at all. IF someone would want to group a Task to all Files inside a Package, it could be done by grouping Task to Files instead and not the whole Package.

yes, and it's well defined in your data structures: File has backlinks to Task and Package doesn't. So Task should reference either particular files or package? This seems something to think more about because I was thinking to use packages as these groups for tasks.


Consider this example:

image

In this case, it's clear that packages (or assets) are redundant and the only grouping mechanism we need are tasks themselves!

So structure would be like this: Job consists of Tasks which links to source and target Files (assuming that reference Files are available to everyone)!


So we are NOT making workflows -- tasks could be derived from files which existed during Job creation time:

image

@Alino are these assumptions right from your point of view?

Alino commented 5 years ago

So Task should reference either particular files or package? This seems something to think more about because I was thinking to use packages as these groups for tasks.

I am rather in favour of associating Task with Files, and not to make it possible to associate a Task with a Package, due to the updates handling issue...

In this case, it's clear that packages (or assets) are redundant and the only grouping mechanism we need are tasks themselves!

These were my thoughts from the beginning, but then I though we wanted to add Packages as another container to improve on mapping with existing systems somehow. I didn't understand that they were introduced just to solve grouping of Files to a Task.

So structure would be like this: Job consists of Tasks which links to source and target Files (assuming that reference Files are available to everyone)!

I would say this is what we have with current data model in swagger. The only difference is that there is currently one to one relationship between Task and Asset(File) And it seems we want to change it so that one Task can point to many Assets(Files).

So we are NOT making workflows -- tasks could be derived from files which existed during Job creation time:

I agree.

I think an output of a finished Task should be a new File. And they could be associated for example with Task.sourceFileId and Task.targetFileId

Here is an example response from the API call to /files

Note, that there is a finished Task which has targetFileId: 3, and that File with id 3 has been created and is listed in Files.

{
  files: [
    {
      id: '1',
      jobId: '1',
      sourceLanguage: 'en',
      encoding: 'utf8',
      fileOriginalName: 'fileToTranslate.txt',
      tasks: [
        {
          id: '1',
          jobId: '1',
          type: 'translation',
          progress: 'finished',
          targetLanguage: 'sk',
          assignedTo: 'symfonie.com/1223214',
          sourceFileId: 1,
          targetFileId: 3
        },
        {
          id: '2',
          jobId: '1',
          type: 'translation',
          progress: 'pending',
          targetLanguage: 'cz',
          assignedTo: 'symfonie.com/1223214',
          sourceFileId: 1,
          targetFileId: null
        }
      ]
    },
    {
      id: '2',
      isReference: true,
      sourceLanguage: 'en',
      encoding: 'utf8',
      fileOriginalName: 'refferenceText.pdf'
    },
    {
      id: '3',
      sourceLanguage: 'sk',
      encoding: 'utf8',
      fileOriginalName: 'fileToTranslate_sk.txt'
    }
  ]
}

but this example is still assuming that there is one to one relationship between Task and File. In theory we could change these properties (sourceFileId and targetFileId) to be arrays of ids. But maybe we could consider adding Task.groupId which would enable us to maintain one to one relationship, while allowing us to group Tasks.

terales commented 5 years ago

Nice, I love it how it presented (with addition of Task links to many Files)!

I still don't see why we may need this groupId property. If we need to provide some way to group files for edge case they can do it via (string) File.relativePath.

I would like to introduce this property either way, so TAPICC hosts would be able to rebuild original folder structure without need of maintaining tree structure). Or relative path is preserved in fileOriginalName?

assembledStarDust commented 5 years ago

updated model to add fields

asset.originalPath
asset.description
asset.sourceUid for recording where the asset came from, in the case of a transfer between systems.

job.description
job.financial - CAT model info, quote price etc. String of whatever.
job.instructions
job.supplimental- User info of whatever. String of whatever
job.sourceUid 

task.description
task.financial 
task.instructions
task.supplimental
task.taskGroupId
task.sourceUid

leaf.taskGroupId 
leaf.description
removed leaf.taskUid

Concept of taskGroupUid. when a task is updated with a (bilingual) file, how does the next task know what file to download? The taskGroupId is a grouping of tasks that might need to be done in an order (dictated by the underlying system) so that when a particular task becomes active, it refers to the taskGroupId to query to find the latest leaf in the task group.

remove

GET /job/<jobUid>/task/<taskUid>/leaf returns a list of leaf assets for that task.

add

GET /job/<jobUid>/task/<taskGroupUid>/leaf returns a list of leaf WG3 bilingual (or final target) for that task group.
Alino commented 5 years ago

there is a new proposal related to data models here https://github.com/GALAglobal/TAPICC-API-implementation/issues/58

terales commented 5 years ago

@assembledStarDust can we close this proposal in favor of #58?

assembledStarDust commented 5 years ago

yeap.