Pipeline: new location for configuring GFF files and aliases

AlexTate commented 1 year ago

Config file changes: The Alias by... and Feature Source columns have been removed from the Features Sheet. This is a healthy change because these columns were exclusively coupled to each other, and none of the other columns, per rule. This understandably led to some confusion.

GFF file inputs are now defined in the Paths File, where all other non-sample file inputs reside. Its YAML data type is a list of mappings, where each list item holds the path to the file and an optional list of alias attributes for the file. When the Paths File is parsed, only unique GFF files are retained, and if there are duplicate entries for the same path but different aliases, the aliases are merged with duplicates removed and order preserved.

Command line argument changes: The command line arguments for tiny-count have been updated accordingly. Rather than adding the Paths File to the two existing inputs (Samples Sheet and Features Sheet), users need only pass the Paths File which contains the locations of all required file inputs.

Codebase improvements: A new class, PathsFile, has been added to configuration.py to act as an API to tiny-count and the Configuration class. It validates the config file at construction and automatically resolved relative paths upon lookup. This is true in both "pipeline" mode and standalone mode.

Misc. changes and bugfixes:

The joinpath() and from_here() functions (used in all configuration file classes) have been hardened to more reliably handle a wider variety of inputs
CSV files containing a greater than expected number of columns are now handled properly in CSVReader.validate_csv_header()
If GFFValidator was unable to parse any chromosomes from reference genome files, it now continues its search with the next best option (SAM files, currently). Previously this was treated as an indication of chromosome non-overlap.

Closes #234

AlexTate commented 1 year ago

Merge conflicts have been resolved

taimontgomery commented 1 year ago

Tested successfully on ram1 data.

MontgomeryLab / tinyRNA

Pipeline: new location for configuring GFF files and aliases #245