Open Vineet-Sharma29 opened 3 years ago
Parameters would probably be the best way to solve this. You could define a parameter called filename
which you could the pass to your flow using --filename
(as an example). I am not sure I understand why you think parameters would not be a good fit in this case. You could also use IncludeFile
as an alternative (it would include the actual file and save it).
Thanks for replying @romain-intel . In this case, I have to run multiple instance of the pipeline over list of files in a directory, which I am doing programatically and then passing the filename. I think that parameters needs to be passed via cli which isn't feasible when there are many files.
I see. Another possible approach would be to pass the name of the directory as a parameter and then do a foreach over the files inside there (so the filename would be passed as a regular input). I will think if I can think of another solution as well. The FlowSpec
only has the use_cli
argument (defaults to True) so you can try implementing it that way too but I am not sure if it will fully work.
@romain-intel I think using foreach
won't be apt in my case. I have pipeline like this:-
___F------
______ D---| |
| |___G---| |__>
____B-----| |----->H
| |______E_________________> ^
A -| |
|____C________________________________|
Foreach requires joining all the branches, whereas in this case if I spawn task A(extract doc task)
over list of items(filenames) then it is still branched over more tasks like B
and C
.
Even handling use_cli
argument error which happened during join
operation won't help. I tried simplifying my workflow as:-
A->B->H , where H is load to text file task
which does not have any branch or merge. The code won't run over all the files in the list but for first item in the list since Metaflow at present does not have DAG scheduler support
@vineetsharma883 Can you elaborate on the DAG scheduler support? There is an integration with AWS Step Functions and another one on the deck with Argo.
Regarding text processing over multiple files - You can choose whether you want to do for-each over all files or assign a different grouping strategy or just processing the entire directory in one shot. In you case, you can process all the files in step A or you can do a foreach just for step A and have a corresponding join before again forking into B and C.
I am trying to implement a text processing pipeline. Since I need to run it over list of files therefore I am using
__init__
method as follows:-However, on running the worflow I am getting following error:-
I saw a similar issue here but in my case I don't think using parameters would help.