Description

This PR (1) enables automatic gzipped file detection and unzipping as part of the main Snekmer workflow, and (2) overhauls the integration of background files into Snekmer workflows such that background files can be supplied to Snekmer and the kmer profile of background sequences be used to inform the probability of kmers appearing in a given family vs. a background, thus affecting downstream models. For (2), a parallel workflow is enabled in Snekmer that processes background files and sums the kmer profiles observed across the background for integration into the scoring and modeling steps. See the full changelog for details.

Issues

Fixes #37
Fixes #60

Full Changelog

fix: glob all files without exclusion of bg. fix bg file detection
- refactor: all files are streamed to input files, rather than just files without associated background files.
- refactor: background filenames (stripped of extensions) are no longer part of the input stream for rules.score, preventing odd errors
- refactor: updated associated files to pull all input files as desired.
- fix: redo file glob -- file globbing now proceeds through glob_wildcards to more cleanly grab input files
fix: enable unzip -- unzipping has been overhauled (these are forward changes adapted from snekmer 2.0.0 / the biotite-kmers branch) (potentially fixes #60)
fix: add background -- changes have been made to collate background files and use their kmer distribution to subtract a background from protein family kmer models.
fix,refactor: pipe background i/o, update filenames
- fix: snakemake now correctly builds DAG for background workflow, including file unzipping
- refactor: some files have been renamed for simplicity
- refactor: some instances of skm.io.load_npz have been replaced with np.load due to KeyError (perhaps due to numpy or pickle version?)
- refactor: rules.combine_background now uses kmer basis set for each family to reshape each background vector. should make files smaller and workflow more compact
feat: update kmer probability scoring for background subtract
- refactor: kmer probability scoring using background subtraction is now the default scoring method
- feat:snekmer.score.feature_class_probabilities now performs either background subtraction based scoring, family label based scoring, or a combination thereof depending on user input
chore: update config, tick version, and clean up files
- chore: new config parameter config['score']['method'] added for compatibility with additional new(!) scoring methods
- chore: uptick version from v1.1.1 -> v1.4.0
- upticked +3 minor versions in anticipation of two pending PRs
- chore: remove no longer needed files
feat: enable kmer scoring via background subtraction (fixes #37)
- feat: kmers can now be scored by probability score subtracting the observed kmers in a supplied background set, family set, or combining both background and family
- (note: some column headers have changed, which may affect downstream analysis (e.g. integration with #115 , #116))
- feat: to handle user-supplied background files, new rules have been created to count background kmers and combine background kmer counts into a background matrix. The appropriate files for the new workflow have been created.
- feat,refactor: extensive changes have been made to snekmer.score to accommodate the new changes, including:
- feat: snekmer.score.score now has 3 distinct formulae to compute probability scores according to the desired scoring method
- feat: snekmer.score.feature_class_probabilities now also integrates the scoring method
- refactor: extensive code cleanup to remove extraneous functionalities
- refactor: the main scoring rule itself has been significantly altered as follows:
- refactor: all references to the old and not-working "background subtraction" (e.g. separating sequences by "sample" or "background" labels) have been removed
- refactor: extraneous kmer probability scores for every family are no longer calculated; only the family in question's kmer profile is scored
- refactor: scoring method now integrated

PNNL-CompBio / Snekmer

Enable background subtraction / file unzipping #118

Description

Issues

Full Changelog

TODO: