ToniWestbrook / paladin

Protein Alignment and Detection Interface
MIT License
60 stars 7 forks source link

Problem retrieving data from Uniprot #34

Closed marcelosoria closed 6 years ago

marcelosoria commented 6 years ago

Hi, I'm having trouble running paladin when it tries to retrieve data from Uniprot. I get a message saying "Received unexpected job ID size". I copy the completo log below.

I tried installing from bioconda first, and then I did a manual installation and the same problem showed up both times. The program does create the sam and tsv files, and they look fine.

When I run the script "make_test.sh" from the sample_data directory i works fine, and retrieves the information from Uniprot.

Thanks, Marcelo

[M::command_align] Loading the index for reference 'uniprot_sprot.fasta'... [M::index_load_from_disk] Read 0 ALT contigs [W::writeReadsProtein] Brute force ORF detection redundant to MF index, disabling... [M::writeReadsProtein] Detecting open reading frames... [M::writeReadsProtein] Detected and translated 101914 open reading frames in 208577 sequences [M::process] Read 611484 protein sequences (34983238 AA)... [M::mem_process_seqs] Processed 611484 protein sequences in 102.528 CPU sec, 17.206 real sec [M::renderNumberAligned] Aligned 13099 out of 102112 total detected ORF sequences (12.83%) [M::prepareUniprotLists] Aggregating 13099 entries for UniProt report [M::retrieveUniprotOnline] Submitted 5266 of 5266 entries to UniProt... [E::retrieveUniprotOnline] Received unexpected job ID size [E::retrieveUniprotOnline] Received unexpected job ID size [E::retrieveUniprotOnline] Received unexpected job ID size [E::retrieveUniprotOnline] Received unexpected job ID size [E::retrieveUniprotOnline] Received unexpected job ID size

ToniWestbrook commented 6 years ago

Hi @marcelosoria - this error means that when PALADIN is submitting that batch of 5266 protein IDs to UniProt to request a job ID to monitor when the query is complete, UniProt is likely returning an error webpage. This can happen when the servers are busy.

You can try reducing the batch size to something smaller so it will reduce the chance UniProt return an error. Unfortunately, though PALADIN-Plugins lets you easily change the batch size, PALADIN itself has 10,000 hardcoded into a header file in the current version. I'm going to change this in the future, but for now, if you edit "uniprot.h" in the source, and change the

#define UNIPROT_MAX_SUBMIT 10000

Line to something smaller, like 2000, then recompile, you may have better luck. You don't want to make this too small if you have a lot of reads, as it will take a long time to download all the data. But for 13099 ORFs, it won't be too bad. Hopefully that at least gets you going for now - let me know how it goes. Otherwise I can investigate more.

marcelosoria commented 6 years ago

Hi, You're right. I recompiled with a lower value for UNIPROT_MAX_SUBMIT and it worked. The problem is that 2000, as you suggested, was to also too high. I tried 500 and then jumped really low to 20. It worked with 20, but it is rather slow. For the next runs I'll try to find a better value. So, basically it is a problem solved. The program is great, it just provides the data I need. Thanks ! Marcelo