Error loading large files

GoogleCodeExporter commented 9 years ago

> What steps will reproduce the problem?
Loading large data files. These can be Mascot, Tandem, or OMSSA. The problem 
seems to be related to the size of the files (or combined size) rather than the 
source of the file.

> What is the expected output? What do you see instead?
Typically the error I get is something like; A string constant starting with 
''A0EVJ8_cus_A0EVJ9_cus_A0EVK0_cus_A0EVK1_cus_A0FKC4_cus_A0FK&' is too long.

> What version of the product are you using? On what operating system?
0.22.6, CentOS, 64-bit

> Please provide any additional information below.
I have not been able to work out what the size threshold is, or maybe, for the 
problem to occur. I have recently started searching very large data sets and 
routinely end up with DATs that are over 250 MB. Loading these files has been 
my problem. I can load around 2 at a time, but above that doesn't work.

> If the reported issue resulted in the tool crashing, please
> also upload the file called PeptideShaker.log (found in the
> PeptideShaker-X.Y.Z\resources folder).

The log follows;

Mon Nov 04 07:32:50 EST 2013: PeptideShaker version 0.22.6.

Total amount of memory in the Java virtual machine: 128647168.

Java version: 1.7.0_19.

1714 script command tokens
(C) 2009 Jmol Development
Jmol Version: 12.0.43  2011-05-03 14:21
java.vendor: Oracle Corporation
java.version: 1.7.0_19
os.name: Linux
memory: 22.5/128.6
processors available: 32
useCommandThread: false
Reindexing: ICKP_140_IS_human_concatenated_target_decoy.fasta.
File I:/MassspecCoreGroup/Search 
Engine/DB/SearchGUI/ICKP_140_seq_concatenated_target_decoy.fasta was not found. 
Please select a different FASTA file.
java.io.FileNotFoundException: 
/home/phains/bin/PeptideShaker-0.22.6/I:/MassspecCoreGroup/Search 
Engine/DB/SearchGUI/ICKP_140_seq_concatenated_target_decoy.fasta (No such file 
or directory)
    at java.io.RandomAccessFile.open(Native Method)
    at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
    at java.io.RandomAccessFile.<init>(RandomAccessFile.java:118)
    at uk.ac.ebi.pride.tools.braf.BufferedRandomAccessFile.<init>(BufferedRandomAccessFile.java:45)
    at uk.ac.ebi.pride.tools.braf.BufferedRandomAccessFile.<init>(BufferedRandomAccessFile.java:53)
    at com.compomics.util.experiment.identification.SequenceFactory.loadFastaFile(SequenceFactory.java:357)
    at eu.isas.peptideshaker.fileimport.FileImporter.importSequences(FileImporter.java:171)
    at eu.isas.peptideshaker.fileimport.FileImporter$IdProcessorFromFile.importFiles(FileImporter.java:511)
    at eu.isas.peptideshaker.fileimport.FileImporter$IdProcessorFromFile.doInBackground(FileImporter.java:476)
    at javax.swing.SwingWorker$1.call(SwingWorker.java:296)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
    at java.util.concurrent.FutureTask.run(FutureTask.java:166)
    at javax.swing.SwingWorker.run(SwingWorker.java:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
No Decoy section found!!
<CompomicsError> PeptideShaker processing failed. See the PeptideShaker log for 
details. </CompomicsError>
java.sql.SQLException: A string constant starting with 
''A0EVJ8_cus_A0EVJ9_cus_A0EVK0_cus_A0EVK1_cus_A0FKC4_cus_A0FK&' is too long.
    at org.apache.derby.impl.jdbc.SQLExceptionFactory40.getSQLException(Unknown Source)
    at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source)
    at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source)
    at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source)
    at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source)
    at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown Source)
    at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
    at org.apache.derby.impl.jdbc.EmbedStatement.executeLargeUpdate(Unknown Source)
    at org.apache.derby.impl.jdbc.EmbedStatement.executeUpdate(Unknown Source)
    at com.compomics.util.db.ObjectsDB.deleteObject(ObjectsDB.java:664)
    at com.compomics.util.experiment.identification.IdentificationDB.removeProteinMatch(IdentificationDB.java:263)
    at com.compomics.util.experiment.identification.Identification.removeProteinMatch(Identification.java:832)
    at eu.isas.peptideshaker.PeptideShaker.cleanProteinGroups(PeptideShaker.java:2363)
    at eu.isas.peptideshaker.PeptideShaker.processIdentifications(PeptideShaker.java:339)
    at eu.isas.peptideshaker.fileimport.FileImporter$IdProcessorFromFile.importFiles(FileImporter.java:574)
    at eu.isas.peptideshaker.fileimport.FileImporter$IdProcessorFromFile.doInBackground(FileImporter.java:476)
    at javax.swing.SwingWorker$1.call(SwingWorker.java:296)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
    at java.util.concurrent.FutureTask.run(FutureTask.java:166)
    at javax.swing.SwingWorker.run(SwingWorker.java:335)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:722)
Caused by: java.sql.SQLException: A string constant starting with 
''A0EVJ8_cus_A0EVJ9_cus_A0EVK0_cus_A0EVK1_cus_A0FKC4_cus_A0FK&' is too long.
    at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
    at org.apache.derby.impl.jdbc.SQLExceptionFactory40.wrapArgsForTransportAcrossDRDA(Unknown Source)
    ... 23 more
Caused by: ERROR 54002: A string constant starting with 
''A0EVJ8_cus_A0EVJ9_cus_A0EVK0_cus_A0EVK1_cus_A0FKC4_cus_A0FK&' is too long.
    at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
    at org.apache.derby.impl.sql.compile.SQLParser.stringLiteral(Unknown Source)
    at org.apache.derby.impl.sql.compile.SQLParser.literal(Unknown Source)
    at org.apache.derby.impl.sql.compile.SQLParser.valueSpecification(Unknown Source)
    at org.apache.derby.impl.sql.compile.SQLParser.valueExpressionPrimary(Unknown Source)
    at org.apache.derby.impl.sql.compile.SQLParser.primary(Unknown Source)
    at org.apache.derby.impl.sql.compile.SQLParser.primaryExpressionXX(Unknown Source)
    at org.apache.derby.impl.sql.compile.SQLParser.primaryExpression(Unknown Source)
    at org.apache.derby.impl.sql.compile.SQLParser.unaryExpression(Unknown Source)
    at org.apache.derby.impl.sql.compile.SQLParser.multiplicativeExpression(Unknown Source)
    at org.apache.derby.impl.sql.compile.SQLParser.additiveExpression(Unknown Source)
    at org.apache.derby.impl.sql.compile.SQLParser.remainingNonNegatablePredicate(Unknown Source)
    at org.apache.derby.impl.sql.compile.SQLParser.remainingPredicate(Unknown Source)
    at org.apache.derby.impl.sql.compile.SQLParser.predicate(Unknown Source)
    at org.apache.derby.impl.sql.compile.SQLParser.booleanPrimary(Unknown Source)
    at org.apache.derby.impl.sql.compile.SQLParser.isSearchCondition(Unknown Source)
    at org.apache.derby.impl.sql.compile.SQLParser.andExpression(Unknown Source)
    at org.apache.derby.impl.sql.compile.SQLParser.orExpression(Unknown Source)
    at org.apache.derby.impl.sql.compile.SQLParser.valueExpression(Unknown Source)
    at org.apache.derby.impl.sql.compile.SQLParser.whereClause(Unknown Source)
    at org.apache.derby.impl.sql.compile.SQLParser.deleteBody(Unknown Source)
    at org.apache.derby.impl.sql.compile.SQLParser.preparableDeleteStatement(Unknown Source)
    at org.apache.derby.impl.sql.compile.SQLParser.preparableSQLDataStatement(Unknown Source)
    at org.apache.derby.impl.sql.compile.SQLParser.StatementPart(Unknown Source)
    at org.apache.derby.impl.sql.compile.SQLParser.Statement(Unknown Source)
    at org.apache.derby.impl.sql.compile.ParserImpl.parseStatement(Unknown Source)
    at org.apache.derby.impl.sql.GenericStatement.prepMinion(Unknown Source)
    at org.apache.derby.impl.sql.GenericStatement.prepare(Unknown Source)
    at org.apache.derby.impl.sql.conn.GenericLanguageConnectionContext.prepareInternalStatement(Unknown Source)
    ... 17 more
Free memory: 4280350536

Original issue reported on code.google.com by snoor...@gmail.com on 3 Nov 2013 at 11:50

GoogleCodeExporter commented 9 years ago

Basically this is a protein inference issue (peptide to protein mapping) and 
not directly related to the size of the files, however the bigger the file the 
more likely you are to have more complex protein inference issues.

Errors like "'A0EVJ8_cus_A0EVJ9_cus_A0EVK0_cus_A0EVK1_cus_A0FKC4_cus_A0FK...' 
is too long" means that you have a protein group that when listed as above 
creates a longer string of characters than what can be stored in the local 
PeptideShaker database. The maximum length is 32672 characters, so a pretty 
long list of proteins can be stored.

From the accession numbers in the example it would seem to you are searching 
against the whole of UniProt? If this is not especially needed I would rather 
recommend searching against SwissProt, i.e., the reviewed sequences in UniProt.

Another thing that often tend to help is making sure that the mgf files are 
peak picked. This results in smaller and better quality mgf files and seems to 
reduce the chance of getting these overly complex protein groups.

Would also be nice if you could try opening the same files in our current beta 
version 
(http://code.google.com/p/peptide-shaker/downloads/detail?name=PeptideShaker-0.2
3.0-beta.zip) to see if the problem of these large protein groups have been 
fixed there or not.

Original comment by harald.b...@gmail.com on 4 Nov 2013 at 1:29

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Hi,

I am not searching against all of Uniprot. In fact, for some of the data, I
am searching against a very small database of only a few hundred proteins.
I originally noted the behaviour loading large DAT results. In some cases,
I am running very complex samples over long gradients and getting around
1000 - 3000 protein IDs in a single run. I am now seeing the behavious for
the smaller database searches (hundreds of proteins in the database), but I
need to load numerous DAT files to invoke the issue.

All of the data is peak picked prior to searching. I will try the beta and
let you know how it goes.

Original comment by snoor...@gmail.com on 4 Nov 2013 at 4:43

GoogleCodeExporter commented 9 years ago

I've now looked at the code and it doesn't seem like using the beta version 
will help. However, I see how to solve the problem with the large protein group 
identifiers and will let you know when we have a new beta version for you to 
test.

Until then there is nothing you can do except search against databases with 
less complex protein inference groups. If you look at the sequences in the 
error message you will see that A0EVJ8, A0EVJ9, etc are all unreviewed and have 
very similar sequences. And you have then identified one (or more) of the 
peptides shared by all of these protein sequences, resulting in our identifier 
for the group (basically the list if accession numbers) becoming too long to 
store in the database.

So is there anyway you can simplify your database while waiting for our fix? 
How big is the database btw?

Original comment by harald.b...@gmail.com on 4 Nov 2013 at 10:14

Changed state: Started

GoogleCodeExporter commented 9 years ago

Hello,

I implemented a fix which should allow you to load your files in the next 
version of PeptideShaker. Can you make some files available for me to test?

Thank you!

Marc

Original comment by mvau...@gmail.com on 4 Nov 2013 at 6:43

GoogleCodeExporter commented 9 years ago

Hi Marc,

I'm actually happy to say the beta version worked for the files I tested.
These had crashed with 0.22.6. The database in this case was quite small,
only a few hundred proteins. I will test it with my larger database and
larger DATs if you want.

I usually restrict the database to a given species, be that human or
rodentia. It's unusual for me to expand to mammalia, but I do on occasions.
The samples giving problems at the moment are all human or rodentia
database searches.

I'm happy to supply files for testing, or test them here, whatever is
easier.

Let me know,

Peter

Original comment by snoor...@gmail.com on 4 Nov 2013 at 8:30

GoogleCodeExporter commented 9 years ago

Just following up from this. I have successfully loaded files that crashed
0.22.6 in the 0.23.0-beta version. The DATs were quite large and I was
using Rodentia (Uniprot). Everything proceeded as expected. I will try
loading the same data with Tandem, Mascot and OMSSA searches.

Original comment by snoor...@gmail.com on 5 Nov 2013 at 1:04

GoogleCodeExporter commented 9 years ago

Hi Peter!

Glad to know that the new version fixed the problem. As you experienced the
new version has a better handling of the protein inference, beware, that
also means you cannot compare the number of identified proteins between
versions. It is crucial that you use the same version for the entire
project :)

Best regards!

Marc

Original comment by mvau...@gmail.com on 5 Nov 2013 at 9:10

GoogleCodeExporter commented 9 years ago

Original comment by harald.b...@gmail.com on 17 Nov 2013 at 11:26

Changed state: Fixed

Immortalin / peptide-shaker

Error loading large files #33