cgat-developers / ruffus

CGAT-ruffus is a lightweight python module for running computational pipelines
MIT License
173 stars 34 forks source link

[architecture] Specify pipeline in task_decorator #111

Open shouldsee opened 5 years ago

shouldsee commented 5 years ago

There is no reason to hard code the Pipeline to be main_pipeline. Can we please fix this?

https://github.com/cgat-developers/ruffus/blob/67d2f12587384dc95496506d7025cdf27a467cd7/ruffus/task.py#L344

Acribbs commented 5 years ago

Sorry I'm new to looking at the code in this repository. To me calling the class and naming it main_pipeline at the beginning makes sense. Are you able to elaborate on your issue so I can understand it further? Do you suggest that the Pipeline class isn't declared as a global variable?

shouldsee commented 5 years ago

@Acribbs This concerns class vs. instances. I am perfectly happy to define an instance as a global variable, but I don't see attaching a global dictionary to class definition as appropriate. If you want a global dictionray, why don't make it ruffus.main_pipeline={}

Acribbs commented 5 years ago

Ah I see, interpreted your issue wrongly, thanks for the further explanation. What do you think @AndreasHeger ?

shouldsee commented 5 years ago

I think I am confusing Pipeline.pipelines with the original issue. Will update once I get to laptop.

Sent from my iPhone

On 23 Mar 2019, at 21:55, Adam Cribbs notifications@github.com<mailto:notifications@github.com> wrote:

Ah I see, interpreted your issue wrongly, thanks for the further explanation. What do you think @AndreasHegerhttps://github.com/AndreasHeger ?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/cgat-developers/ruffus/issues/111#issuecomment-475908006, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHqXbGQez7nwkJa2mU4e2G0L8UL3GnZxks5vZqLVgaJpZM4b_ViU.

Acribbs commented 5 years ago

ok no worries, I'm looking at the code and a little confused. But iv only really started to understand ruffus recently so wasn't surprised, but if you say there may be a confusion then I will wait for your comments. thanks

shouldsee commented 5 years ago

Expanded snippet:

class task_decorator(object):

    """
        Forwards to functions within Task
    """

    def __init__(self, *decoratorArgs, **decoratorNamedArgs):
        """
            saves decorator arguments
        """
        self.args = decoratorArgs
        self.named_args = decoratorNamedArgs

    def __call__(self, task_func):
        """
            calls func in task with the same name as the class
        """
        # add task to main pipeline
        # check for duplicate tasks inside _create_task
        task = main_pipeline._create_task(task_func)

So we are talking about class task_decorator here. Say I instantiated a decorator by dec = task_decorate() and used it to decorate f = lambda x: 'hello', with g = dec(f). I would find myself unable to use anything other than main_pipeline. So if I created another pipeline pp1 = Pipeline(name='pp1') , there would be no way of using pp1 instead of main_pipeline.

Pipeline should be configurable at some step, either __init__ or __call__. A simplest edition would be

    def __call__(self, task_func, pipeline = None):
        """
            calls func in task with the same name as the class
        """
        if pipeline is None:
            pipeline = main_pipeline
        # add task to main pipeline
        # check for duplicate tasks inside _create_task
        task = pipeline._create_task(task_func)

Or

    def __init__(self, pipeline= None, *decoratorArgs, **decoratorNamedArgs):
        """
            saves decorator arguments
        """
        self.args = decoratorArgs
        self.named_args = decoratorNamedArgs
        if pipeline is None:
            pipeline = main_pipeline
        self.pipeline = pipeline
    def __call__(self, task_func,):
        """
            calls func in task with the same name as the class
        """
        # add task to main pipeline
        # check for duplicate tasks inside _create_task
        task = self.pipeline._create_task(task_func)        
IanSudbery commented 5 years ago

This is almost certainly an incompatibility of the two modes of ruffus - the functional paradigm and the object oriented one.

Certainly in the functional paradigm (which is the one that I think most people use), there is no concept of main and non-main pipelines - a pipeline is a file. I don't know how many people use the object oriented interface, but I think part of the reason for it was to allow self contained sub pipelines.

If we wanted to allow this we should use the second format, because the once decorated, functions are never directly called.

However, having the pipeline as the first arguement would break all existing pipelines, so I suggest:

    def __init__(self, *decoratorArgs, **decoratorNamedArgs):
        """
            saves decorator arguments
        """
        self.args = decoratorArgs
        self.named_args = decoratorNamedArgs
        if pipeline in decoratorNameArgs:
            self.pipeline = decoratorNamedArgs['pipeline']
        else:
            self.pipeline = main_pipeline

    def __call__(self, task_func,):
        """
            calls func in task with the same name as the class
        """
        # add task to main pipeline
        # check for duplicate tasks inside _create_task
        task = self.pipeline._create_task(task_func)  

But it would need to be well documented somewhere.

shouldsee commented 5 years ago

This is almost certainly an incompatibility of the two modes of ruffus - the functional paradigm and the object oriented one.

Certainly in the functional paradigm (which is the one that I think most people use), there is no concept of main and non-main pipelines - a pipeline is a file. I don't know how many people use the object oriented interface, but I think part of the reason for it was to allow self contained sub pipelines. ...

@IanSudbery

I am not very familiar with how ruffus supposed to be used in general, but I would quite like to realise the functionality in my toy example, bascially factoring out the "pipeline building" from the "job mapping" as two separate processes.

My guess is that current architecture must have been adopted to simplify certain syntax that was commonly re-used, but might quite unfriendly to maintainer at my first glance... Apology if my guess is too wild here.

In terms of code, your proposal would trigger Name 'pipeline' not found, but we can use instead the dict.get method:

   def __init__(self, *decoratorArgs, **decoratorNamedArgs):
       """
           saves decorator arguments
       """
       self.args = decoratorArgs
       self.named_args = decoratorNamedArgs
       self.pipeline= decoratorNameArgs.get('pipeline', main_pipeline)

   def __call__(self, task_func,):
       """
           calls func in task with the same name as the class
       """
       # add task to main pipeline
       # check for duplicate tasks inside _create_task
       task = self.pipeline._create_task(task_func)  
IanSudbery commented 5 years ago

I am not very familiar with how ruffus supposed to be used in general,

I'm not really sure about "supposed" to be used, only how i've seen people use it. Must people I know that use ruffus, would use the syntax in simple.py here.

If we are ever keen to seperate business logic from pipeline flow we do as in seperate_logic.py here.

Personally I think this syntax is less abstract, cleaner, more readable and more straight forwardly pythonic, without the nested function calls etc. For me this concreteness, and pure python approach is why I prefer ruffus to snakemake, but I admit this is purely a matter of personal taste

I see no reason not to enable separate pipelines to be specified, although I didn't think it was necessary to what you wanted to do. Mind you, my mind has always shruken away whenever I've tried to look too hard under the hood of ruffus.

The code I posted shouldn't throw NameError because it checks for it first, but i agree your suggestion is equivalent but cleaner.

shouldsee commented 5 years ago

Dear Ian,

I was only trying to point out that you forgot to wrap your “pipeline” in the proposal so that intepreter would be looking up the variable rather checking the “in” statement by this is kind of a minor, uninteresting point. Also note that you have a typo in the second decorator in “separate_logic.py”

Thanks for your thoughtful reply. I totally agreed that the choice of syntax would be more of a taste than anything. I am struggling a bit to see how separate_logic.py separates the logic: the data flow through functions is still defined with decorators and we are just creating an extra layer with the same signature here. It’s true that we can save run_bwa() in a separate file though and we could call multiple functions within a transformed function. But then the suffix signature would not be associated with the inner run_bwa() call but with the outer call only. Surely we want to save the suffix pattern with the reused pipeline (or not?).

I guess the ultimate reason for my reluctance is due to the decorator really, meaning I’m just not very comfortable to see something hanging above my function.... But once I started thinking about it, it actually makes some sense. Decorators here are actually not FOP and that was why I am confused. The decorator is defining a literal transformation in the data, whereas function defines the contextual change. Decorators dictate how names change whereas functions define how data change. It might be simpler to expand two changes into individual functions rather than confining name changes with a pre-designed syntax. But this means ruffus is trying to provide a grammar on how name changes should be implemented.

I am curious on how you would write a pipeline step that requires two argument rather than one though.

Sent from my iPhone

On 25 Mar 2019, at 22:37, Ian Sudbery notifications@github.com<mailto:notifications@github.com> wrote:

I am not very familiar with how ruffus supposed to be used in general,

I'm not really sure about "supposed" to be used, only how i've seen people use it. Must people I know that use ruffus, would use the syntax in simple.py herehttps://gist.github.com/IanSudbery/cc274d97d38fd88df34ca392b7a0d859#file-simple-py.

If we are ever keen to seperate business logic from pipeline flow we do as in seperate_logic.py [here[(https://gist.github.com/IanSudbery/cc274d97d38fd88df34ca392b7a0d859#file-seperate_logic-py).

Personally I think this syntax is less abstract, cleaner, more readable and more straight forwardly pythonic, without the nested function calls etc. For me this concreteness, and pure python approach is why I prefer ruffus to snakemake, but I admit this is purely a matter of personal taste

I see no reason not to enable separate pipelines to be specified, although I didn't think it was necessary to what you wanted to do. Mind you, my mind has always shruken away whenever I've tried to look too hard under the hood of ruffus.

The code I posted shouldn't throw NameError because it checks for it first, but i agree your suggestion is equivalent but cleaner.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/cgat-developers/ruffus/issues/111#issuecomment-476404332, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AHqXbCsSeEdQvGQltKNlhRVrymUrhbDMks5vaU-igaJpZM4b_ViU.