tinyRNA provides an all-in-one solution for precision analysis of sRNA-seq data. At the core of tinyRNA is a highly flexible counting utility, tiny-count, that allows for hierarchical assignment of reads to features based on positional information, extent of feature overlap, 5’ nucleotide, length, and strandedness.
GNU General Public License v3.0
1
stars
1
forks
source link
Pipeline: new location for configuring GFF files and aliases #245
Config file changes:
The Alias by... and Feature Source columns have been removed from the Features Sheet. This is a healthy change because these columns were exclusively coupled to each other, and none of the other columns, per rule. This understandably led to some confusion.
GFF file inputs are now defined in the Paths File, where all other non-sample file inputs reside. Its YAML data type is a list of mappings, where each list item holds the path to the file and an optional list of alias attributes for the file. When the Paths File is parsed, only unique GFF files are retained, and if there are duplicate entries for the same path but different aliases, the aliases are merged with duplicates removed and order preserved.
Command line argument changes:
The command line arguments for tiny-count have been updated accordingly. Rather than adding the Paths File to the two existing inputs (Samples Sheet and Features Sheet), users need only pass the Paths File which contains the locations of all required file inputs.
Codebase improvements:
A new class, PathsFile, has been added to configuration.py to act as an API to tiny-count and the Configuration class. It validates the config file at construction and automatically resolved relative paths upon lookup. This is true in both "pipeline" mode and standalone mode.
Misc. changes and bugfixes:
The joinpath() and from_here() functions (used in all configuration file classes) have been hardened to more reliably handle a wider variety of inputs
CSV files containing a greater than expected number of columns are now handled properly in CSVReader.validate_csv_header()
If GFFValidator was unable to parse any chromosomes from reference genome files, it now continues its search with the next best option (SAM files, currently). Previously this was treated as an indication of chromosome non-overlap.
Config file changes: The
Alias by...
andFeature Source
columns have been removed from the Features Sheet. This is a healthy change because these columns were exclusively coupled to each other, and none of the other columns, per rule. This understandably led to some confusion.GFF file inputs are now defined in the Paths File, where all other non-sample file inputs reside. Its YAML data type is a list of mappings, where each list item holds the
path
to the file and an optional list ofalias
attributes for the file. When the Paths File is parsed, only unique GFF files are retained, and if there are duplicate entries for the same path but different aliases, the aliases are merged with duplicates removed and order preserved.Command line argument changes: The command line arguments for
tiny-count
have been updated accordingly. Rather than adding the Paths File to the two existing inputs (Samples Sheet and Features Sheet), users need only pass the Paths File which contains the locations of all required file inputs.Codebase improvements: A new class, PathsFile, has been added to configuration.py to act as an API to tiny-count and the Configuration class. It validates the config file at construction and automatically resolved relative paths upon lookup. This is true in both "pipeline" mode and standalone mode.
Misc. changes and bugfixes:
joinpath()
andfrom_here()
functions (used in all configuration file classes) have been hardened to more reliably handle a wider variety of inputsCSVReader.validate_csv_header()
Closes #234