Allow custom Span model names

dievsky commented 5 years ago

Inspired by #1408.

Preamble: My Span-fitting Snakemake pipeline needs to declare an output file, and there's no pattern that I can specify and be sure that the model name will conform to (thanks to the ID reducer). With peaks, we can specify a peak file name directly. With the model, not so much, yet the model files are exposed to the user too (we encourage viewing them in JBR).

Thesis: I propose to add an optional command line argument --model specifying a path to a model file. Relative path is interpreted as relative to the standard fit directory. Then Span uses the provided path to save and load the model. If no --model is provided, well, bring on the ID reducer. If the model exists, the treatment and control paths are only checked against the model info.

Benefits:

The model file is discoverable by automated pipelines.
The user can call peaks in command line with only the model file present. Currently it's perfectly possible in JBR, but utterly impossible in command line interface.

Disadvantages:

One more optional command line argument (but a fully backwards-compatible one).
Slightly enhanced foot-shooting abilities: what if the user accidentally names two models the same and gets confused as to which one is loaded? That's why we need the treatment and control path check. Span would refuse to load the model if the paths didn't match.

dievsky commented 5 years ago

As customary, I'll wait some time for opinions.

olegs commented 5 years ago

I'd better stay with default strategy, after fixing #1408 In this case model name will be combined of track and control names, and it will work without any additional parameters.

dievsky commented 5 years ago

As I mentioned, the proposed approach would be completely optional and backwards-compatible. If the user doesn't need a discoverable model, they skip specifying the argument and get the default-named model file. So Span would continue to work without any additional parameters, while still allowing the user to discover the model file if needed, and allowing reuse of model files to call and tune peaks via command line.

dievsky commented 5 years ago

After #1408 , the model name is even less discoverable than before, since now there are more reducing options. This poses a problem for computational pipelines which treat the model file as (intermediate) output. The only current solutions are:

parse the log output and try to extract the model file from there;
craft a file name pattern and hope that other models don't interfere;
launch in a shadow directory and rename the only .span file in the cache.

olegs commented 5 years ago

Can you please provide an example where after #1408 naming became worse?

olegs commented 5 years ago

Also there are no pipelines dealing with models directly (except models tuning in JBR).

dievsky commented 5 years ago

My own scATAC-seq pipeline deals with models directly. :))

dievsky commented 5 years ago

Discoverability problem is solved by pull request #3 . I'd still like to make peak calling possible without treatment files if the model is provided (currently they're still required arguments), but that can wait until a later release.

olegs commented 5 years ago

Already implemented as of https://github.com/JetBrains-Research/span/issues/9

JetBrains-Research / span

Allow custom Span model names #6