Open leonjessen opened 5 years ago
I think all these functions fit in the scope of tidysq (mutate data from biological sequences into features understandable for ML). Only sequence logos seem to be slightly off the scope, but in the end, they are relevant to for example presentation of significant features (motifs). Right now, during the phase of intense development, I would rather keep everything in one bin. After your holiday we might see a need to split the functionality between different packages.
So, my original thought with
PepTools
, was a small super light weight, non-dependent (I.e. onlybase
code) toolbox for working with peptide data (which is what we do in the group). E.g.X
one-hot
,BLOSUM
,atchley-factors
,BLOSUM_pca
, etc.At the same time, I wanted to use it in my teaching ("Immunological Bioinformatics" and "R for Bio Data Science")
Some of the functions would be simple wrappers, primarily to match the terminology of bioinformatics, e.g.
and then also include standard data, like the
PepTools2::BLOSUM62
andPepTools2::BLOSUM50
, natural background frequenciesPepTools2::BGFREQS
and example peptidesPepTools2::PEPTIDES
. Furthermore, theggseqlogo
package is quite nice, but it only support simple shannon entropy based logos, which is sub-optimal compared to Kullback-Leibler logos. So basically, I wanted to extend with the ability to compute PSSMs to match the functionality of Seq2Logo, these matrices could then be visualised using thecustom
functionality ofggseqlogo
. Lastly, my intention was to name all functions using the prefixpep_
Thinking about it, perhaps, we should make the PepTools package as a separate package, but still as a sub-part of
tidysq
? A bit likeggplot2
is a part oftidyverse
?I'm interested in your thoughts? 👍