d6t / d6tflow

Python library for building highly effective data science workflows
https://d6tflow.readthedocs.io/en/latest/
MIT License
951 stars 77 forks source link

Unable to reuse Task in different file #3

Closed festeh closed 5 years ago

festeh commented 5 years ago

Hello!

Suppose that I created a TaskData instance, executed it and saved outputs. Then I'd like to reuse its results in some other file (say, Jupyter Notebook to analyze results) by creating an instance of that task there and calling output and then load. But I would be unable to do so as data directory won't be found there - I'll have to re-execute a task.

My suggestion: use absolute path in output method: instead of settings.dirpath (simply data) use something like pathlib.Path(__file__).parent / "data" - that would enable an option to access results from everywhere.

What do you think?

d6tdev commented 5 years ago

Thanks for raising the issue! Looked into this and it was because TaskPreprocess was saved in memory using TaskPreprocess(d6tflow.tasks.TaskCachedPandas) so if you open it somewhere else the task is incomplete because it wasn't saved to disk, you would have to run the task again for it to be loaded in memory. It's an advanced feature but it's confusing for the example so changed it to TaskPreprocess(d6tflow.tasks.TaskPqPandas) try again it should work now https://github.com/d6t/d6tflow/blob/master/docs/example-ml.md

also added notebook example to template repo, you can clone and run see https://github.com/d6t/d6tflow-template/blob/master/visualize.ipynb

last you can easily access task output across multiple places using d6tpipe.

But neither d6tpipe nor the fixed path would have fixed the problem with the task output being in memory only. Closing for now reopen if still having issues.