Closed marcverhagen closed 1 month ago
Well, it actually did work for the example I used but that was an accident because the MMIF file referred to a file example-transcript.txt
which at the time when the image was created happened to live in the working directory..
The thing is, it will work as implemented currently as long as the MMIF file that we send to the REST server uses a path local to the server for the value of the location
property. This is unlikely to be the same path as the location on the client so we have a situation where we can process the MMIF file only locally or in the server, but not both. Unless, that is, if the client and server have the same paths.
In general, there is something weird about sending in a MMIF file with the understanding that the locations referred to in the file already are on the server. I wonder if we should have a slightly different set up:
I think this means rewriting the clams-python
so it has more than one resource and coming up with a plan on how to manage the data.
@keighrim You may have already thought about this when working on the appliance code so I am interested in hearing your thoughts.
I don't understand the necessity of a dedicated container to handle a shared docker volume, as a volume operates independently from existing images or running containers.
The way the appliance handles file locations is that an archive directory provided by the user at the build time, in which all documents (of different types) reside, is turned into a docker volume at the build time and mounted to all orchestrated containers to the same mount point at the run time. (currently as ro
). In the appliance, the pipeline engine (galaxy) is also running as a container and the location of the archive in its file system is identical to those in other app containers' file system.
For this pure python-based engine, simple solutions would be
I like the second options better because I think, in the end, the pipeline engine doesn't need to know or check if the files specified in the locations
are valid, simply because the engine does not process the media files in any way. If we want to have a fancier engine that does checking on validity of media files, tool chains, and I/O specifications, we can use the first option, but I don't think that's going to happen any time soon in our development roadmap.
I also want to make a note on a possible (and perhaps soon necessary) expansion of the values of the location
prop. At the moment we are using vanilla unix paths as its values, but we might need to consider accepting fully qualified URIs only (for local files, with file:
protocol) so that we can support NFSs, including http:
and s3:
.
I like the second better too since it does not sound right to me that the paths have to be the same on both ends.
The thing that makes me a bit uneasy is that the MMIF file and the documents it refers to are separated. What I had in mind for that management container is that it would give you a list of files on the volume and that you could create the MMIF file using that container by sending it a request with the identifier of the file.
On the location property: yes, that would be useful.
Clearly, what we have now works, but at some point it might be nice to explore what the smoothest and most transparent ways are of dealing with data.
This issue is solved, but leaving it open for a little bit because some of the discussion may be relevant for a to-be-opened issue elsewhere.
Closing as the discussion concluded, and couldn't find any "to-be-opened" relevant issues. With better support for non-file://
schemes and customizable docloc plugins, I consider this problem is completely solved.
The current implementation only deals with text documents and then only if the text is embedded in a
text
property. If the text document uses thelocation
property to refer to an external file then usingis useless unless the value of
location
in the MMIF file is a valid path on the container.