When searching for variants which share the same signature mutations using the dedupe_sigmut_mat function, the goal is to cluster variants.
We basically transform a matrix of dimension [nAllVar x nMut] into a matrix [nSharedMutVar x nMut].
The column names of the resulting matrix are basically the pasted names of variants with the same signature.
While this might work for a few variants with highly distinct mutations, this approach might cause trouble if more variants are introduced with less variability in their mutations or too little data. This would lead to a larger number of similar variants and potentially very long column names, as recently seen here.
To solve this potential issue I suggest the following solution:
When searching for variants which share the same signature mutations using the
dedupe_sigmut_mat
function, the goal is to cluster variants. We basically transform a matrix of dimension [nAllVar x nMut] into a matrix [nSharedMutVar x nMut]. The column names of the resulting matrix are basically the pasted names of variants with the same signature.While this might work for a few variants with highly distinct mutations, this approach might cause trouble if more variants are introduced with less variability in their mutations or too little data. This would lead to a larger number of similar variants and potentially very long column names, as recently seen here.
To solve this potential issue I suggest the following solution:
dedupe_sigmut_mat
function, instead of returning a modified matrix, maybe it would be easier to directly return the group_list aka lineage defined at https://github.com/BIMSBbioinfo/pigx_sars-cov-2/blob/485f7df73caa7b4eb74b1f7ab63091f0a888249c/scripts/deconvolution_funs.R#L198 and iterate over this, subsetting the original matrix in each step of the iteration. This would potentially simplify the code.