arendsee / phylostratr

An R framework for phylostratigraphy
GNU General Public License v3.0
33 stars 7 forks source link

Error in readLines(con) : HTTP error 400. #25

Closed kdarragh1994 closed 2 years ago

kdarragh1994 commented 2 years ago

Hi,

I am trying to run the Arabidopsis example and am running into difficulties. When running any of the uniprot commands e.g. uniprot_weight_by_ref or uniprot_strata I get the same error message: Error in readLines(con) : HTTP error 400.

My error for this chunk of code in the markdown file is this:

Error in readLines(con) : HTTP error 400. 5.readLines(con)

  1. readLines(con) %>% cast
  2. wrap_uniprot_id_retrieval(db = "taxonomy", query = query, cast = as.integer, ...)
  3. uniprot_downstream_ids(clade, reference_only = TRUE)
  4. uniprot_weight_by_ref()

Thank you, Kathy

arendsee commented 2 years ago

@kdarragh1994 Uniprot deprecated their old API a few months ago and that is what broke everything in phylostratr. I just updated the Uniprot functions and everything should work now. Let me know if you run into any further problems.

kdarragh1994 commented 2 years ago

@arendsee Thank you for fixing this. The main issue that I am having which I'm not sure if this is normal or not because I haven't used phylostratr before is that the files in the uniprot-seqs folder are very small, for example, using 7227 the Drosophila melanogaster ID there are tens of proteins in the file when there should be thousands according to the proteome on uniprot. All of these files are only a few KB in the uniprot-seqs folder. Do you know what might be causing this?

kdarragh1994 commented 2 years ago

@arendsee I am also definitely in the correct working directory in R and yet get the error message: BLAST Database error: No alias or index file found for protein database for all organisms.

arendsee commented 2 years ago

@kdarragh1994 Hmm, I'll need to double check what is going on. Can you send me the exact code you are running?

kdarragh1994 commented 2 years ago

@arendsee

Thank you! I attached the peptide file in case you need it also (I changed .fa to .txt for uploading purposes). The first issue is as mentioned above, the proteomes seem small. And then when I got to run the blastdb section I get this error for every proteome: BLAST Database error: No alias or index file found for protein database However, I am in the correct directory and I can see the proteome files (although small). Thank you again for your help with this.

Here is my code:

weights=uniprot_weight_by_ref() focal_taxid <- '1507135' strata <- uniprot_strata(focal_taxid, from=2) %>% strata_apply(f=diverse_subtree, n=5, weights=weights) %>% use_recommended_prokaryotes %>% add_taxa(c('7227','4932', '9606', '132113','88501','44477','143995','166423','178035','516756','597456','1437190')) %>% uniprot_fill_strata

strata@data$faa[['1507135']] <- 'peptides_all.fa'

strata <- strata_blast(strata, blast_args=list(nthreads=8)) %>% strata_besthits results <- merge_besthits(strata)

peptides_all.txt

arendsee commented 2 years ago

Thanks, I found the problem. The files are short because the new API returns data in chunks. I am currently working on the code to loop through these chunks and collect all the data. I should have everything working in a day or two. I'll stay in touch.

arendsee commented 2 years ago

@kdarragh1994 OK, I think I fixed the problem. Pull from github and let me know if everything is working correctly. I'm still having trouble removing isoforms from proteomes (which is the default behavior). Your code will raise some warnings about this, but it shouldn't be a problem. You will just have a few extra genes in some of the target species. You may want to provide your own data for the focal species.

Let me know how it goes and good luck with the bees!

kdarragh1994 commented 2 years ago

@arendsee Thanks so much! I think it must be working now because I am getting an out of memory error in R so am trying to get it working on our university cluster and running into some difficulties but I think these are more cluster-related than issues with the package itself. Have you successfully done this or do you mainly run things locally?

arendsee commented 2 years ago

Phylostratr isn't very efficient with memory. I'll see if I can fix that. One thing you might try in the meantime is to subsample the 'peptides_all.fa' file down to, say, 100 proteins and see if it works.

kdarragh1994 commented 2 years ago

@arendsee The error message that I am getting is at the first step when downloading the proteomes before I have read in the peptides_all.fa file. Uniprot is accessed and then a while later I get "Error: out of memory". The folder which should hold the proteomes is not even created.

On the cluster I get stuck at the same step but with error message: "Error: '~/.cache/R/taxizedb/taxdump/names.dmp' does not exist. In addition: Warning message: In utils::unzip(db_path_file, files = c("names.dmp", "nodes.dmp"), : error 1 in extracting from zip file". I wonder if it has something to do with the file structure for R in the cluster vs local R and where it is looking for the files.

Any advice on either issue really appreciated!!

arendsee commented 2 years ago

I'm working on it, the uniprot API is still not quite right. I'm rewriting it using the SPARQL interface. I've figured out the SPARQL queries, but haven't gotten it integrated yet.

As for the taxizedb error, it sounds like the database hasn't been downloaded on your system. Try deleting the .cache/R/taxizedb directory. Then the database should be downloaded automatically. But you'll still have to wait to use phylostratr until I fix the API.

arendsee commented 2 years ago

@kdarragh1994 The SPARQL API is working now. I haven't done careful end-to-end tests again. Hopefully, I can get these done over the next couple days. You can try running the program again. It should work on your personal computer assuming it is a Linux or Mac.

kdarragh1994 commented 2 years ago

@arendsee Thanks so much, all ran smoothly! Just wanted to let you know there is no longer a proteome available for "1895832". I removed it following previous instructions from another thread but thought I should give you a heads up. Thanks again!

arendsee commented 2 years ago

@kdarragh1994 Yay, I'm glad it is working! That proteome is for one of the prokaryote representatives, right? I guess I should update them at some point. Though it should be fine for now to just leave it out.