Depends on #104 and #105. Note that this branch has #105 as its upstream.
This PR implements several improvements to the preprocessing pipeline.
Full support for ambiguous (conflicting) alleles. A new pipeline option allow_allele_conflicts has been added, which allows allele conflicts to be retained (not filtered out). Specifically, this option affects the behaviors of pp.error_correct_umis and pp.filter_molecule_table. Functions in pp.utilities have been appropriately modified to support allele conflicts.
intBCs can now be corrected to a whitelist, using Levenshtein distance. Note that Levenshtein distance is calculated using a custom implementation in the ngs_tools library, which has knowledge of ambiguous bases, unlike the Levenshtein library, which does not.
Implemented a new step in the preprocessing pipeline: filter_bam, which filters any reads with low quality bases in either the UMI or barcode sequences.
Deprecated skip_existing option from pp.collapse_umis as this behavior is inconsistent with other preprocessing functions. We may want to add this functionality back later, for all preprocessing functions.
Parallelized the following preprocessing stages: collapse_umis, align_sequences, error_correct_umis
Moved n_threads as a general setting in the preprocessing pipeline.
Minor change to data.utilities.compute_dissimilarity_map so that the internal function can still be numbaized in object mode for a slight speedup.
Moved is_ambiguous_state function from data.utilities to mixins.utilities because this function is needed in the preprocessing pipeline as well.
Raise error when trying to use a GreedySolver or ILPSolver on a tree with ambiguous states.
The table index (first column containing row index) is no longer written to the output tables in the preprocessing pipeline.
Depends on #104 and #105. Note that this branch has #105 as its upstream.
This PR implements several improvements to the preprocessing pipeline.
allow_allele_conflicts
has been added, which allows allele conflicts to be retained (not filtered out). Specifically, this option affects the behaviors ofpp.error_correct_umis
andpp.filter_molecule_table
. Functions inpp.utilities
have been appropriately modified to support allele conflicts.ngs_tools
library, which has knowledge of ambiguous bases, unlike theLevenshtein
library, which does not.filter_bam
, which filters any reads with low quality bases in either the UMI or barcode sequences.skip_existing
option frompp.collapse_umis
as this behavior is inconsistent with other preprocessing functions. We may want to add this functionality back later, for all preprocessing functions.collapse_umis
,align_sequences
,error_correct_umis
n_threads
as a general setting in the preprocessing pipeline.data.utilities.compute_dissimilarity_map
so that the internal function can still be numbaized in object mode for a slight speedup.is_ambiguous_state
function fromdata.utilities
tomixins.utilities
because this function is needed in the preprocessing pipeline as well.GreedySolver
orILPSolver
on a tree with ambiguous states.