Closed GoogleCodeExporter closed 9 years ago
Basically this is a protein inference issue (peptide to protein mapping) and
not directly related to the size of the files, however the bigger the file the
more likely you are to have more complex protein inference issues.
Errors like "'A0EVJ8_cus_A0EVJ9_cus_A0EVK0_cus_A0EVK1_cus_A0FKC4_cus_A0FK...'
is too long" means that you have a protein group that when listed as above
creates a longer string of characters than what can be stored in the local
PeptideShaker database. The maximum length is 32672 characters, so a pretty
long list of proteins can be stored.
From the accession numbers in the example it would seem to you are searching
against the whole of UniProt? If this is not especially needed I would rather
recommend searching against SwissProt, i.e., the reviewed sequences in UniProt.
Another thing that often tend to help is making sure that the mgf files are
peak picked. This results in smaller and better quality mgf files and seems to
reduce the chance of getting these overly complex protein groups.
Would also be nice if you could try opening the same files in our current beta
version
(http://code.google.com/p/peptide-shaker/downloads/detail?name=PeptideShaker-0.2
3.0-beta.zip) to see if the problem of these large protein groups have been
fixed there or not.
Original comment by harald.b...@gmail.com
on 4 Nov 2013 at 1:29
Hi,
I am not searching against all of Uniprot. In fact, for some of the data, I
am searching against a very small database of only a few hundred proteins.
I originally noted the behaviour loading large DAT results. In some cases,
I am running very complex samples over long gradients and getting around
1000 - 3000 protein IDs in a single run. I am now seeing the behavious for
the smaller database searches (hundreds of proteins in the database), but I
need to load numerous DAT files to invoke the issue.
All of the data is peak picked prior to searching. I will try the beta and
let you know how it goes.
Original comment by snoor...@gmail.com
on 4 Nov 2013 at 4:43
I've now looked at the code and it doesn't seem like using the beta version
will help. However, I see how to solve the problem with the large protein group
identifiers and will let you know when we have a new beta version for you to
test.
Until then there is nothing you can do except search against databases with
less complex protein inference groups. If you look at the sequences in the
error message you will see that A0EVJ8, A0EVJ9, etc are all unreviewed and have
very similar sequences. And you have then identified one (or more) of the
peptides shared by all of these protein sequences, resulting in our identifier
for the group (basically the list if accession numbers) becoming too long to
store in the database.
So is there anyway you can simplify your database while waiting for our fix?
How big is the database btw?
Original comment by harald.b...@gmail.com
on 4 Nov 2013 at 10:14
Hello,
I implemented a fix which should allow you to load your files in the next
version of PeptideShaker. Can you make some files available for me to test?
Thank you!
Marc
Original comment by mvau...@gmail.com
on 4 Nov 2013 at 6:43
Hi Marc,
I'm actually happy to say the beta version worked for the files I tested.
These had crashed with 0.22.6. The database in this case was quite small,
only a few hundred proteins. I will test it with my larger database and
larger DATs if you want.
I usually restrict the database to a given species, be that human or
rodentia. It's unusual for me to expand to mammalia, but I do on occasions.
The samples giving problems at the moment are all human or rodentia
database searches.
I'm happy to supply files for testing, or test them here, whatever is
easier.
Let me know,
Peter
Original comment by snoor...@gmail.com
on 4 Nov 2013 at 8:30
Just following up from this. I have successfully loaded files that crashed
0.22.6 in the 0.23.0-beta version. The DATs were quite large and I was
using Rodentia (Uniprot). Everything proceeded as expected. I will try
loading the same data with Tandem, Mascot and OMSSA searches.
Original comment by snoor...@gmail.com
on 5 Nov 2013 at 1:04
Hi Peter!
Glad to know that the new version fixed the problem. As you experienced the
new version has a better handling of the protein inference, beware, that
also means you cannot compare the number of identified proteins between
versions. It is crucial that you use the same version for the entire
project :)
Best regards!
Marc
Original comment by mvau...@gmail.com
on 5 Nov 2013 at 9:10
Original comment by harald.b...@gmail.com
on 17 Nov 2013 at 11:26
Original issue reported on code.google.com by
snoor...@gmail.com
on 3 Nov 2013 at 11:50