PyWorkflowApp / visual-programming

A Python Visual Programming Workspace for Data Science
MIT License
33 stars 12 forks source link

Generalize file up/download; use file-system prefix instead of FileSystemStorage #47

Closed reelmatt closed 4 years ago

reelmatt commented 4 years ago

This PR at least partially addresses some of the bugs @reddigari summarized in #43 dealing with file naming/access.

~Workflow still relies on Django's FileSystemStorage API, but is now passed in to the constructor rather than being defined in multiple places: pyworkflow/node.py and workflow/node views.py. Most file operations are still handled in the Workflow class, but the create_node method can now do the following:~

json_data["options"][field] = request.pyworkflow.fs.path(opt_value)

when a user uploads a file during node configuration, or when a download is triggered from WriteCsv. If a user enters test.csv for the output file, when the update endpoint is triggered, this is converted to /tmp/test.csv using the Workflow's file system.

This helps solve the "giant security problem" of downloads, if we switch the endpoint to POST with the Node's info. It's still hard-coded for the WriteCsvNode because it looks up the Node's path_or_buf option where the output filename is stored, but this does prevent arbitrary downloads.

~The store_node_data and retrieve_node_data static methods in Workflow haven't been updated but I think (?) these can be done easily if this approach seems good.~

reelmatt commented 4 years ago

One of the ideas I had behind the intermediate classes like IONode and ManipulationNode was it could be useful to store information common to any node of that type (e.g. 'files' for IONodes, and ¯_(ツ)_/¯ for ManipulationNodes). ReadCsv would then treat that 'file' as input and WriteCsv would treat it as output, but both could be read/accessed similarly. And this could avoid having to specify 'input_file' or 'output_file' when configuring the Node; you could just have 'file' and then the concrete Node implementation would figure it out as long as some value is provided.

Likewise, this could probably help when creating nodes too. I think we could avoid checking the OPTION_TYPES completely à la

for field, info in json_data.get("option_types", dict()).items():
    if info["type"] == "file" or info["name"] == "Filename":
        ...

and instead just check DEFAULT_OPTIONS if we knew any file/filepath is stored in a generic "file" attribute.

json_data["options"]["file"]
reelmatt commented 4 years ago

This latest commit—51c4ee5— does change file handling a bit more than the initial PR which moved FileSystemStorage solely to pyworkflow/workflow.py. The API is now not used, an instead settings.MEDIA_ROOT is passed in to the Workflow constructor and regular Python os calls are used.

I think this better reflects the discussion from #43, but is obviously a bigger change with potentially big implications. If this deviates too far from what others are thinking, we can always revert to before entirely, or c664947 from earlier in this PR.