Use config file for TPS workflow inputs

agitter commented 4 years ago

We discussed using a config or properties file to track all of the input files and settings that a user needs to specify. That would greatly reduce the number of command line arguments needed.

A YAML file could be one option. We should look at what other modern software uses.

agitter commented 4 years ago

This blog post gives a thorough overview of different file types and associated parsers that are often used for config files: https://hackersandslackers.com/simplify-your-python-projects-configuration/

Using YAML as an example, we could have a config file like

tps:
  network: data/networks/input-network.tsv
  timeseries: data/timeseries/median-time-series.tsv
  firstscores: data/timeseries/p-values-first.tsv
  ...

cytoscape: /home/seluser/cytoscape/start.sh

agitter commented 4 years ago

Snakemake uses a YAML or JSON config file: https://snakemake.readthedocs.io/en/stable/snakefiles/configuration.html#standard-configuration

If we use YAML, it would make it easier to switch from a script-driven workflow to a Snakemake workflow later.

ajshedivy commented 4 years ago

Here is a list of all of the parameter files

TPS:

--network <file>: Input network file in TSV format, where each row defines an undirected edge.
--timeseries <file>: Input time series file in TSV format. The first line defines the time point labels, and each subsequent line corresponds to one time series profile.
--firstscores <file>: Input file that contains significance scores for each time point of a profile (except the first time point), with respect to the first time point of the profile.
--prevscores <file>: Similar to --firstscores, an input file that gives significance scores for each time point (except the first one), with respect to the previous time point.
--source <value>: Identifier for the network source node. Multiple source nodes can be provided by repeating the argument multiple times. For example, --source <node1> --source <node2> --source <node3>.
--threshold <value>: Threshold value for significance scores, above which measurements are considered non-significant.

Annotation generation:

peptideMapFile - TPS input peptide to protein mapping, does not yet support mapping to multiple proteins
timeSeriesFile - TPS input file with peptide time series
peptideFirstScoreFile - TPS input file with peptide significance scores when comparing to first time point
peptidePrevScoreFile - TPS input file with peptide significance scores when comparing to previous time point
windowsFile - TPS output file with activity windows
networkFile -TPS output sif file with network edges
goldStandardFile - list of proteins in the gold standard reference pathway
pvalThresh - p-value threshold to apply to TPS input first and prev score files, peptides with p-value <= the threshold are significant, set to 1E-10 if less than 1E-10
logTransform - Boolean, if true take log2 of the time series data
styleTemplateFile - a Cytoscape style file template
outAnnotFile - filename of the Cytoscape annotation file to write
outStyleFile - filename of the Cytoscape style file
logDefault - default value to use instead of log2(0) when taking the log transform, defaults to -1.0 if a value is not provided
addZero - prepend a 0 to the peptide time series
- repairMissing - fill in values for missing data; if the first time point is missing, set it to 1 if logTransform is True or 0 otherwise; if later time points are missing, set them to the previous observed time point

Visualization workflow

output.sif
style file
cytoscape session file name
annotations data types file
Cytoscape path

agitter commented 4 years ago

Thanks, it's very helpful to see all of these listed explicitly. The sheer number makes me prefer the config file option even more. That would be a lot of required arguments to supply at the command line.

Some of these are also redundant in the sense that the same input file is used in two different stages (e.g. timeSeriesFile) or the output of one stage is consumed as input by another stage.

We can also think more about setting reasonable defaults. For instance, most users won't need to specify a custom styleTemplateFile.

agitter commented 4 years ago

One example from the Manubot project of using subprocess and passing arguments: https://github.com/manubot/manubot/blob/217e51473f1fd1c6427803676b3c70d44314bb93/manubot/pandoc/bibliography.py

ajshedivy / tps