BIMSBbioinfo / pigx_sars-cov-2

PiGx SARS-CoV-2 wastewater sequencing pipeline
GNU General Public License v3.0
18 stars 3 forks source link

Step away from modifying column names #158

Open alexg9010 opened 1 year ago

alexg9010 commented 1 year ago

When searching for variants which share the same signature mutations using the dedupe_sigmut_mat function, the goal is to cluster variants. We basically transform a matrix of dimension [nAllVar x nMut] into a matrix [nSharedMutVar x nMut]. The column names of the resulting matrix are basically the pasted names of variants with the same signature.

While this might work for a few variants with highly distinct mutations, this approach might cause trouble if more variants are introduced with less variability in their mutations or too little data. This would lead to a larger number of similar variants and potentially very long column names, as recently seen here.

To solve this potential issue I suggest the following solution:

  1. for the dedupe_sigmut_mat function, instead of returning a modified matrix, maybe it would be easier to directly return the group_list aka lineage defined at https://github.com/BIMSBbioinfo/pigx_sars-cov-2/blob/485f7df73caa7b4eb74b1f7ab63091f0a888249c/scripts/deconvolution_funs.R#L198 and iterate over this, subsetting the original matrix in each step of the iteration. This would potentially simplify the code.