PyWorkflowApp / visual-programming

A Python Visual Programming Workspace for Data Science
MIT License
31 stars 12 forks source link

Unable to execute a saved workflow #55

Open diegostruk opened 4 years ago

diegostruk commented 4 years ago

When attempting to load and execute a saved workflow that contains ReadCsv nodes issues will arise if the generated csv which results from uploading a file during workflow creation is not there anymore.

To reproduce:

When saving the workflow, the json that is persisted contains the references to the generated files. I came across this issue when working in the CLI and then realized that the same is happening in the UI.

While this is not a blocker, I wanted to document this issue somewhere so that we can have a discussion.

reddigari commented 4 years ago

Someone asked us about this during the presentation, suggesting that the downloaded workflow could include a zip archive of any input data. I'm open to this, but it could result in enormous downloads. The alternative I proposed to whoever asked was that when you load the workflow, the server checks whether all necessary input files exist, and alerts the user to upload ones that do not. But I'd bet this is harder than it sounds to get done quickly.

diegostruk commented 4 years ago

Yes, that's a feasible solution. However, like you mention it will probably be a bit tricky to implement in a short time. Could this be done maybe at node execution level? Something like the if when executed the ReadCsv finds out that the file is not there it will check if there is a file with at least the name.csv portion and if so it will try to copy the new file under the desired name (uuid-name.csv). This might also be tricky but just something else to think about. Curious to hear if the rest of the team has any other solutions in mind.

reddigari commented 4 years ago

If you're loading a saved workflow then the node will have the same UUID, I think this only happens if /tmp got cleared (i.e. the server container got restarted).

diegostruk commented 4 years ago

Exactly, the problem is: I work on a workflow, save it (download it), container gets restarted. Then I try to start the container again and run my workflow. I won't be able to :(

reddigari commented 4 years ago

It might not actually be too hard:

What do you think?

diegostruk commented 4 years ago

I think that will work fine for that use case. I think for a CLI we might be out of luck at that point, that's why I was mentioning also the execute solution as something to try before letting the user know that there's a file missing.

reddigari commented 4 years ago

Ahhh I see, that's a good idea. I can't remember if we made any decisions on Monday about passing in data files from the CLI? I know reading data from stdin is a requirement, but I could also imagine something like $pyworkflow --file1 mydata.csv --file2 otherdata.csv myworkflow.json

diegostruk commented 4 years ago

Exactly. We didn't really discuss the passing of data from stdin but that could be a solution as well. We might have to think a bit more about how would all that work (haven't gotten that far yet!). For now all I was implementing was running a pyworflow saved in a .json like $pyworflow --workflow-location /tmp/a-workflow.json execute