This PR (1) enables automatic gzipped file detection and unzipping as part of the main Snekmer workflow, and (2) overhauls the integration of background files into Snekmer workflows such that background files can be supplied to Snekmer and the kmer profile of background sequences be used to inform the probability of kmers appearing in a given family vs. a background, thus affecting downstream models. For (2), a parallel workflow is enabled in Snekmer that processes background files and sums the kmer profiles observed across the background for integration into the scoring and modeling steps. See the full changelog for details.
Issues
Fixes #37
Fixes #60
Full Changelog
fix: glob all files without exclusion of bg. fix bg file detection
refactor: all files are streamed to input files, rather than just files without associated background files.
refactor: background filenames (stripped of extensions) are no longer part of the input stream for rules.score, preventing odd errors
refactor: updated associated files to pull all input files as desired.
fix: redo file glob -- file globbing now proceeds through glob_wildcards to more cleanly grab input files
fix: enable unzip -- unzipping has been overhauled (these are forward changes adapted from snekmer 2.0.0 / the biotite-kmers branch) (potentially fixes #60)
fix: add background -- changes have been made to collate background files and use their kmer distribution to subtract a background from protein family kmer models.
fix: snakemake now correctly builds DAG for background workflow, including file unzipping
refactor: some files have been renamed for simplicity
refactor: some instances of skm.io.load_npz have been replaced with np.load due to KeyError (perhaps due to numpy or pickle version?)
refactor: rules.combine_background now uses kmer basis set for each family to reshape each background vector. should make files smaller and workflow more compact
feat: update kmer probability scoring for background subtract
refactor: kmer probability scoring using background subtraction is now the default scoring method
feat:snekmer.score.feature_class_probabilities now performs either background subtraction based scoring, family label based scoring, or a combination thereof depending on user input
chore: update config, tick version, and clean up files
chore: new config parameter config['score']['method'] added for compatibility with additional new(!) scoring methods
chore: uptick version from v1.1.1 -> v1.4.0
upticked +3 minor versions in anticipation of two pending PRs
chore: remove no longer needed files
feat: enable kmer scoring via background subtraction (fixes #37)
feat: kmers can now be scored by probability score subtracting the observed kmers in a supplied background set, family set, or combining both background and family
(note: some column headers have changed, which may affect downstream analysis (e.g. integration with #115 , #116))
feat: to handle user-supplied background files, new rules have been created to count background kmers and combine background kmer counts into a background matrix. The appropriate files for the new workflow have been created.
feat,refactor: extensive changes have been made to snekmer.score to accommodate the new changes, including:
feat: snekmer.score.score now has 3 distinct formulae to compute probability scores according to the desired scoring method
feat: snekmer.score.feature_class_probabilities now also integrates the scoring method
refactor: extensive code cleanup to remove extraneous functionalities
refactor: the main scoring rule itself has been significantly altered as follows:
refactor: all references to the old and not-working "background subtraction" (e.g. separating sequences by "sample" or "background" labels) have been removed
refactor: extraneous kmer probability scores for every family are no longer calculated; only the family in question's kmer profile is scored
Description
This PR (1) enables automatic gzipped file detection and unzipping as part of the main Snekmer workflow, and (2) overhauls the integration of background files into Snekmer workflows such that background files can be supplied to Snekmer and the kmer profile of background sequences be used to inform the probability of kmers appearing in a given family vs. a background, thus affecting downstream models. For (2), a parallel workflow is enabled in Snekmer that processes background files and sums the kmer profiles observed across the background for integration into the scoring and modeling steps. See the full changelog for details.
Issues
Full Changelog
rules.score
, preventing odd errorsglob_wildcards
to more cleanly grab input filesskm.io.load_npz
have been replaced withnp.load
due to KeyError (perhaps due to numpy or pickle version?)rules.combine_background
now uses kmer basis set for each family to reshape each background vector. should make files smaller and workflow more compactsnekmer.score.feature_class_probabilities
now performs either background subtraction based scoring, family label based scoring, or a combination thereof depending on user inputsnekmer.score
to accommodate the new changes, including:snekmer.score.score
now has 3 distinct formulae to compute probability scores according to the desired scoring methodsnekmer.score.feature_class_probabilities
now also integrates the scoring method