Proteomicslab57357 / UniprotR

Retrieving Information of Proteins from Uniprot
GNU General Public License v3.0
59 stars 18 forks source link

GetProteinAnnontate produces "cannot open the connection" error due to bad/illegal format #29

Closed muecker closed 2 years ago

muecker commented 2 years ago

Hi, great package to retrieve information from Uniprot, I find it very useful - thank you for this!

Unfortunately, the GetProteinAnnontate function does not work for me (GetProteinFunction, GetNamesTaxa, GetProteinGOInfo and all others I tested do work).

Here is my code: ProteinAcc<-"P42293" GetProteinAnnontate(ProteinAcc,c("entry name", "protein names"))

It produces the following warnings and errors: Warning: URL 'http://www.uniprot.org/uniprot/?query=accession:P42293&format=tab&columns=entry name': status was 'URL using bad/illegal format or missing URL'Warning: URL 'http://www.uniprot.org/uniprot/?query=accession:P42293&format=tab&columns=entry name': status was 'URL using bad/illegal format or missing URL'Error in file(file, "rt"):
cannot open the connection to 'http://www.uniprot.org/uniprot/?query=accession:P42293&format=tab&columns=entry name'

I tried different formats of column names (both "Legacy Returned Field" and "Returned Field" from this page: https://www.uniprot.org/help/return_fields) but this did not change anything.

I am using UniprotR Version 2.2.1 with R Version 4.2.1 (2022-06-23).

Do you have an advice on how to solve this?

AliYoussef96 commented 2 years ago

Hi,

I appreciate your interest in using our package. I believe that is due to the new API updates by Uniprot DB. We will solve this issue in the nearest update of UniprotR. For now, I wrote a new version of the GetProteinAnnontate function so you can use it until the next update.

library(curl)

GetProteinAnnontate <- 
function (ProteinAccList, columns) 
{
  if (!has_internet()) {
    message("Please connect to the internet as the package requires internect connection.")
    return()
  }
  baseUrl <- "https://rest.uniprot.org/uniprotkb/"
  ProteinInfoParsed_total_col = data.frame(x = "x")
  for (filed in columns) {
    ProteinInfoParsed_total <- data.frame()
    for (ProteinAcc in ProteinAccList) {
      Request <- tryCatch({
        GET(paste0(baseUrl, ProteinAcc, ".xml"), timeout(10))
      }, error = function(cond) {
        message("Internet connection problem occurs and the function will return the original error")
        message(cond)
      })
      ProteinName_url <- paste0("/search?query=accession:", ProteinAcc, 
                                "&format=tsv&fields=", filed)
      RequestUrl <- paste0(baseUrl, ProteinName_url)
      if (length(Request) == 0) {
        message("Internet connection problem occurs")
        return()
      }
      if (Request$status_code == 200) {
        parse_true <- function() {
          ProteinInfoParsed <- as.data.frame(read.csv(RequestUrl, 
                                                      sep = "\t", header = TRUE), row.names = ProteinAcc)
          return(ProteinInfoParsed)
        }
        parse_false <- function() {
          ProteinInfoParsed <- read.csv(RequestUrl, 
                                        sep = "\t", header = TRUE)
          names <- names(ProteinInfoParsed)
          ProteinInfoParsed <- data.frame(name_col = "NA", 
                                          row.names = ProteinAcc)
          colnames(ProteinInfoParsed) <- names
          return(ProteinInfoParsed)
        }
        ProteinInfoParsed <- tryCatch(parse_true(), 
                                      error = function(e) parse_false())
        ProteinInfoParsed_total <- rbind(ProteinInfoParsed_total, 
                                         ProteinInfoParsed)
      }
      else {
        HandleBadRequests(Request$status_code)
      }
    }
    ProteinInfoParsed_total_col <- cbind(ProteinInfoParsed_total_col, 
                                         ProteinInfoParsed_total)
    remove(ProteinInfoParsed_total)
  }
  ProteinInfoParsed_total_col <- ProteinInfoParsed_total_col[, 
                                                             !(names(ProteinInfoParsed_total_col) %in% c("x"))]
  return(ProteinInfoParsed_total_col)
}

Run this function and then you good to go

ProteinAcc<-"P42293"
GetProteinAnnontate(ProteinAccList,c("gene_names", "protein_name"))

Note you have to use the Returned Field from this link https://www.uniprot.org/help/return_fields

AliYoussef96 commented 2 years ago

@MohmedSoudy

Could you please consider this modified version of GetProteinAnnontate in the next update on CRAN?

MohmedSoudy commented 2 years ago

@AliYoussef96 Sure.

muecker commented 2 years ago

@AliYoussef96 Thank you for the new function - works like a charm! :)