clamsproject / wfe-pipeline-runner

3 stars 1 forks source link

Deal with the location property #1

Closed marcverhagen closed 1 month ago

marcverhagen commented 3 years ago

The current implementation only deals with text documents and then only if the text is embedded in a text property. If the text document uses the location property to refer to an external file then using

$ curl -i -H "Accept: application/json" -X PUT -d@examples/mmif-east-tesseract.json http://0.0.0.0:5001/

is useless unless the value of location in the MMIF file is a valid path on the container.

marcverhagen commented 3 years ago

Well, it actually did work for the example I used but that was an accident because the MMIF file referred to a file example-transcript.txt which at the time when the image was created happened to live in the working directory..

The thing is, it will work as implemented currently as long as the MMIF file that we send to the REST server uses a path local to the server for the value of the location property. This is unlikely to be the same path as the location on the client so we have a situation where we can process the MMIF file only locally or in the server, but not both. Unless, that is, if the client and server have the same paths.

In general, there is something weird about sending in a MMIF file with the understanding that the locations referred to in the file already are on the server. I wonder if we should have a slightly different set up:

I think this means rewriting the clams-python so it has more than one resource and coming up with a plan on how to manage the data.

@keighrim You may have already thought about this when working on the appliance code so I am interested in hearing your thoughts.

keighrim commented 3 years ago

I don't understand the necessity of a dedicated container to handle a shared docker volume, as a volume operates independently from existing images or running containers.

The way the appliance handles file locations is that an archive directory provided by the user at the build time, in which all documents (of different types) reside, is turned into a docker volume at the build time and mounted to all orchestrated containers to the same mount point at the run time. (currently as ro). In the appliance, the pipeline engine (galaxy) is also running as a container and the location of the archive in its file system is identical to those in other app containers' file system.

For this pure python-based engine, simple solutions would be

  1. we run the engine as a container with the archive mounted to the same mount point (same strategy as the galaxy appliance) OR
  2. we generate input MMIF files to use paths of the media files on the client containers (considering the mount point), not the paths on the host file system.

I like the second options better because I think, in the end, the pipeline engine doesn't need to know or check if the files specified in the locations are valid, simply because the engine does not process the media files in any way. If we want to have a fancier engine that does checking on validity of media files, tool chains, and I/O specifications, we can use the first option, but I don't think that's going to happen any time soon in our development roadmap.


I also want to make a note on a possible (and perhaps soon necessary) expansion of the values of the location prop. At the moment we are using vanilla unix paths as its values, but we might need to consider accepting fully qualified URIs only (for local files, with file: protocol) so that we can support NFSs, including http: and s3:.

marcverhagen commented 3 years ago

I like the second better too since it does not sound right to me that the paths have to be the same on both ends.

The thing that makes me a bit uneasy is that the MMIF file and the documents it refers to are separated. What I had in mind for that management container is that it would give you a list of files on the volume and that you could create the MMIF file using that container by sending it a request with the identifier of the file.

On the location property: yes, that would be useful.

marcverhagen commented 3 years ago

Clearly, what we have now works, but at some point it might be nice to explore what the smoothest and most transparent ways are of dealing with data.

marcverhagen commented 3 years ago

This issue is solved, but leaving it open for a little bit because some of the discussion may be relevant for a to-be-opened issue elsewhere.

keighrim commented 1 month ago

Closing as the discussion concluded, and couldn't find any "to-be-opened" relevant issues. With better support for non-file:// schemes and customizable docloc plugins, I consider this problem is completely solved.