Bioconductor / Biostrings

Efficient manipulation of biological strings
https://bioconductor.org/packages/Biostrings
54 stars 16 forks source link

Amino Acid sequence to letter function #93

Closed LiNk-NY closed 3 months ago

LiNk-NY commented 1 year ago

Hi Hervé, @hpages Is there a function that takes a string, e.g., MetThrGly and converts to "MTG"? If not and within scope, I can work on implementing one using AMINO_ACID_CODE. Best,. Marcel

hpages commented 1 year ago

Hmm.. interesting! I've never seen amino acid sequences in that format. Out of curiosity, may I ask how/where people retrieve amino acid sequences that are in the "MetThrGly" format?

LiNk-NY commented 1 year ago

I don't have a good answer.

Perhaps Laurent @lgatto or Johannes @jorainer can provide some insight?

FWIW, this type of functionality is available on webpages and even in matlab: https://www.mathworks.com/help/bioinfo/ref/aminolookup.html

I don't think it would hurt to include it out of convenience given that AMINO_ACID_CODE is in the package.

lgatto commented 1 year ago

I don't know any such functionality and I have never had a need for it. AA codes and other info is available from PSMatch::getAminoAcids(). PSMatch is a package that deals with peptide spectrum matches, i.e. peptides/protein identification from mass spectrometry experiments.

hpages commented 1 year ago

Good to know about PSMatch.

I'd rather have some good use case before adding something like this to Biostrings, or at least have some user requests it. I agree that in theory it doesn't hurt to have it, but still, I'm not a big fan of adding functionalities that nobody is going to use.

Anyways, since I just spent some time playing with this a little, I'll put what I came up with here, for the record:

.prepare_invalid_abbrev3_fancy_msg <- function(x30, bad_idx, n=5L)
{   
    nbad <- length(bad_idx)
    idx <- head(bad_idx, n=n)
    bad_abbrev3 <- x30[idx]
    details <- paste0("\"", bad_abbrev3, "\" at position ", idx, collapse=", ")
    if (nbad > n)            
        details <- paste0(details, " etc... (", nbad - n, " more)")
    paste0("input contains invalid three-letter abbreviation(s): ", details)
}

makeAAStringFromAbbrev3Seq <- function(x, ignore.case=FALSE)
{
    if (!isSingleString(x))
        stop(wmsg("'x' must be a single string"))
    if (nchar(x) %% 3L != 0L)
        stop(wmsg("number of characters in input must be a multiple of 3"))
    if (!isTRUEorFALSE(ignore.case))
        stop(wmsg("'ignore.case' must be TRUE or FALSE"))
    x <- BString(x)  
    x3 <- x30 <- as.character(successiveViews(x, rep.int(3L, length(x) %/% 3L)))
    ALL_ABBREV3 <- c(AMINO_ACID_CODE, `*`="END", `-`="GAP")
    if (ignore.case) {
        x3 <- tolower(x3)
        ALL_ABBREV3 <- tolower(ALL_ABBREV3)
    }
    m <- match(x3, ALL_ABBREV3)
    bad_idx <- which(is.na(m))
    if (length(bad_idx) != 0L) 
        stop(wmsg(.prepare_invalid_abbrev3_fancy_msg(x30, bad_idx)))
    codes <- names(ALL_ABBREV3)[m]
    AAString(paste(codes, collapse=""))
}

makeAAStringFromAbbrev3Seq("MetTrpLysGlnAlaGluAspIleArgAspIleTyrAspPhe")
# 14-letter AAString object
# seq: MWKQAEDIRDIYDF

Thanks guys.

LiNk-NY commented 1 year ago

Thanks for your work on this. Any updates for this issue based on #97? From what I read, it should be easier to implement with the encoding framework.

ahl27 commented 1 year ago

I'm not sure, it's quite a bit more work than I had initially expected--XStringSets assume single byte character input, so we'd have to rewrite quite a bit of stuff to get it to support a multi-character input value. I'm not sure if I'll be able to get to this in the near future, there are other Biostrings issues that are higher priority at the moment on top of my research.

I'd echo Hervé's point that the functionality doesn't seem to be requested by users aside from just having it to have it. If you have a use case that it would be relevant for please let me know, or if you have an implementation feel free to open a PR.

End-users can already get this functionality with something simple like:

# Assume that CONVERSION_STRING is a named character string 
# like c("M","T","G",...) with names c("met","thr","gly",...)

convertAA <- function(aastr){
    converted <- CONVERSION_STRING[strsplit(gsub('([a-z]{3})', '\\1 ', tolower(aastr), ' ')[[1]]]
    AAString(paste(converted, collapse=''))
}

Hervé's function is definitely a lot safer with regard to error checking.

Implementing it in a robust and clean way within Biostrings would be a lot harder; it would likely require a custom method since these characters will all map to amino acids (ex. AAString("MetThrGly") == AAString("METTHRGLY") == AAString("metthrgly").

ahl27 commented 1 year ago

On second thought, it could be pretty simple to just add an optional argument like useThreeLetterCodes=FALSE to the AAString method, and then if true to call a preprocessing function like above (or Hervé's better implementation) to reformat the string from three letter codes to single letter codes.

At that point though, I guess the question is if people are actually doing that, and if so, if that functionality is needed in Biostrings or if end-users can just preprocess it themselves.

ahl27 commented 3 months ago

Sorry for the slow follow up--I think for now I'm going to leave this as unimplemented. I'm not sure it makes sense to change the constructor AAString method to have an additional argument for this case. If people are interested in this functionality I can add it to my backlog to address, but for now it's unplanned. I'll keep the issue open in case other people have further thoughts.

LiNk-NY commented 3 months ago

Thanks for following up. I'm okay with leaving it unimplemented since there are no follow ups from the community.

ahl27 commented 3 months ago

I added this to the TODO file so I don't forget about it in the future--I'll look into revisiting this when I have more bandwidth and the higher priority tasks are cleared up.