SCALE-MS / scale-ms

SCALE-MS design and development
GNU Lesser General Public License v2.1
4 stars 4 forks source link

File object management #129

Open eirrgang opened 3 years ago

eirrgang commented 3 years ago

Establish an abstract interface for filesystem objects that can be implemented for RP. File references should allow data flow to be defined with minimal coupling to actual data location at the time of expression. File Futures allow reference to file objects that do not yet exist.

A File reference must be easily localized to the contexts of different workflow managers, such as the client environment to the execution environment and back again.

Data localization and path management must be handled automatically by the WorkflowManager instances.

Unnecessary data transfers must be avoidable through optimization code in WorkflowManager.

Relates to #75

eirrgang commented 3 years ago

It looks like radical.saga.filesystem.File already provides most of the interface we would want for abstractly handling local versus remote files.

@andre-merzky and @mturilli does it seem reasonable to build on SAGA here? How would I extract a radical.saga.Session from the radical.pilot.Session?

Presumably, somewhere under the hood, RP translates its URL scheme to SAGA URLs? Is there an accessible function that client software could use to get a URL with a scheme radical.saga would understand? Or does RP register its extra schemes with the underlying saga resolver?

andre-merzky commented 3 years ago

does it seem reasonable to build on SAGA here?

Yes, I assume so

How would I extract a radical.saga.Session from the radical.pilot.Session?

The rp.Session inherits from the rs.Session, so you should be able to use that session as-is.

Presumably, somewhere under the hood, RP translates its URL scheme to SAGA URLs? Is there an accessible function that client software could use to get a URL with a scheme radical.saga would understand? Or does RP register its extra schemes with the underlying saga resolver?

It is not exposed. The URL translation code in RP is complex and acts differently depending on where the URL is used, what component requests the translation, etc, so I doubt it is immediately useful to ScaleMS.

eirrgang commented 3 years ago

so I doubt it is immediately useful to ScaleMS

Specifically, I am trying to figure out the easiest way to extract a saga File object from a path which may be based on one of the RP-specific URIs provided by RP objects. It isn't always obvious to a programmer whether an attribute is going to need extra processing. The RP documentation mentions radical.utils.Url, but it seems like all of the "sandbox" attributes are just strings.

However, it appears that urllib.parse.urlparse() can handle regular posix file paths just fine, so environment_path = pathlib.Path(urllib.parse.urlparse(rpcomponent.sandbox).path) should be a reasonable normalizer.

But what would be the best way to insert the appropriate SAGA access scheme? (I think this amounts to the filesystem_endpoint from the resource description for the Pilot.)

andre-merzky commented 3 years ago

You can always get the task and pilot sandboxes via task.sandbox and pilot.sandbox, and those URLs will include the access scheme (which is indeed based on the respective config entry).

What operations do you intent to implement (beyond those provided by the staging ops)?

eirrgang commented 3 years ago

What operations do you intent to implement (beyond those provided by the staging ops)?

I don't think there is a need for anything beyond the staging ops. But I don't need to write a wrapper for RP-based file references that includes a bound Pilot, Task, and/or Session if I can easily get a saga.filesystem.File, I think. I can just check whether a source or target path is a saga object to dispatch copy. Maybe even convert all Path and PathLike references to saga objects instead of a scalems.File object.

task.sandbox and pilot.sandbox are just str objects, right? They aren't saga.Url or saga.Directory objects? (If __repr__ == __str__, I may have totally missed this! I should check now. I hope I didn't make a misguided assumption.)

eirrgang commented 3 years ago

task.sandbox and pilot.sandbox are just str objects, right? They aren't saga.Url or saga.Directory objects? (If __repr__ == __str__, I may have totally missed this! I should check now. I hope I didn't make a misguided assumption.)

It looks like Pilot stores the various sandbox URLs as RU URL objects internally, but pilot_sandbox specifically is converted to str when accessed through the property. Internally, it looks like everything is there to get a saga Directory object.

Task.sandbox and Task.pilot_sandbox may produce ru.Url references, but I don't see where they private members are assigned anything other than None.

eirrgang commented 1 year ago

Within the scope of this issue, we should make sure to support a user-provided "label" that can be easily cross-referenced with the local workflow metadata to locate file identifiers in a flexible and user-friendly way.