fulcrumgenomics / dagr

A scala based DSL and framework for writing and executing bioinformatics pipelines as Directed Acyclic GRaphs
MIT License
69 stars 14 forks source link

New task for SortBam. #312

Closed tfenne closed 7 years ago

tfenne commented 7 years ago

I'm curious what your thoughts are @nh13 on how to handle the sort order param. Dagr doesn't depend on fgbio currently, and I don't think we really want it to either. So the enum isn't available.

The options as I see it are:

  1. Take a String for sort order, which means that it'll always take all the values in fgbio (even if they get added to), but will also accept incorrect values
  2. Make an enum here, but then if a sort order gets added in fgbio it has to get added here too
nh13 commented 7 years ago

I see two additional alternatives to give us access to the enum:

  1. Create a new project to store type and enum definitions. The project could be in a stand-alone repo (bio-commons), or we could make fgbio a multi-project build. For the latter, we'd publish two artifacts, and dagr need only depend on the "standards/api" repo. I assume that rarely will there be a case where we'd add to the fgbio "standards/api" without having a need for it in fgbio, although maybe things in Picard or htsjdk need to be exposed, though dagr already depends on those, so maybe not?
  2. Add the enums to the commons project. I don't like this idea since commons right now has no code that is bioinformatics specific.

I am not in favor of (2), since then we risk getting out-of-sync and the usual code-in-two-place headaches.

Currently for fgbio we are doing (1), but for Picard and htsjdk, we are depending on them, so I would be curious to hear more about your thoughts on why we would not want to depend on fgbio in the dagr tasks project, since "in for a penny (htsjdk/Picard), in for a pound (fgbio)". From my perspective it is (1) or (3).

tfenne commented 7 years ago

Re: the enums and stuff. I really dislike option (3) - that last thing I want is yet more projects and/or artifacts! I agree with you on (4) that commons is really common scala code, and not anything bioinformatics-specific. Any kind of separation like that is going to make working on fgbio more challenging, as e.g. adding a new sort would require extending the enum in one project, updating dependencies, then updating the other project. Not to mention we sometimes tend to use enums as factories too, which wouldn't be possible.

Looking back I'm actually thinking that including Picard as a dependency was a mistake. And given how many more dependencies htsjdk has acquired over the last year, I'm starting to feel that way there too.

Maybe the compromise position is:

  1. Use String or Option[String]
  2. Have non-enum constants in dagr-tasks so that it's easy to get the right value, but not impossible to use a value that's not enumerated?
nh13 commented 7 years ago

@tfenne lets continue with using strings and option-strings for now. We can clean up the dependencies latter, and decide on a better approach. It did get me thinking.