Closed whohensee closed 4 weeks ago
Thinking about this while waiting on my re-hire, I think this topic is worth at least thinking on before working to implement this PR:
In this PR, I added the capability to the pipeline to have more than one 'scorer' object to perform the scoring step with multiple different dl/ml algorithms. However, this required a lot of changes to the structure of the pipeline to handle the fact that a single pipeline run could create objects with more than one provenance, as well as the ability to handle having more than one 'scorer' object for that single scoring pipeline step.
While this works, it still feels redundant to me in the following way: The structure of the pipeline already feels like it has solved this problem by being capable of using the provenances to load up earlier steps that have already been done, so in order to create multiple different scores from the same data, we could just run the pipeline once using method ML1 and a second time using method ML2, and no unnecessary work would be duplicated. The downsides are that it would require somehow moving the scoring configuration up a level perhaps to the conductor, and would communicate with the database more than strictly necessary, but the upside is that it would preserve the linear and (mostly) simple structure of the pipeline, which I suspect may save many headaches down the line.
I'm not fully sure whether those listed downsides outweigh the upsides or not, but its probably worth at least considering the implications before committing to one strategy.
My strategy for implementing this object was to create a pipeline object, DeepScore, created by the Scorer object. To give the pipeline the ability to run multiple different dl/ml algorithms, it can accept in config parameters for multiple different algorithms, which will create multiple Scorer objects in the datastore running the pipeline. This required some changes to the provenance tree creation at the start of the pipeline, which I believe I was able to resolve, but was not super elegant (because it breaks a little bit from the original design of the pipeline).
The information on which model and parameters to use for the DeepScore is stored in the
_algorithm
column, which links to a enum table, such that the single integer will point to a model/parameter implementation to use.