compomics / compomics-utilities

Open source Java library for computational proteomics
http://compomics.github.io/projects/compomics-utilities.html
29 stars 17 forks source link

Regex Failure in com.compomics.util.protein.Header #32

Closed tmcgowan closed 5 years ago

tmcgowan commented 5 years ago

In SearchGUI, we have a user appending a source tag -- sihumi_ -- in their FASTA header line: >sp|sihumi_Q8AAB1| GLMS_BACTN Glutamine--fructose-6-phosphate aminotransferase [isomerizing] OS=Bacteroides thetaiotaomicron (strain ATCC 29148 / DSM 2079 / NCTC 10582 / E50 / VPI-5482) GN=glmS PE=3 SV=2

The header is getting parsed at this point in com.compomics.util.protein.Header

LINE 711

else if (aFASTAHeader.matches("^[^\\s]*\\|[^\\s]+_[^\\s]+ .*")) {
                    // New (9.0 release (31 Oct 2006) and beyond) standard SwissProt header as
                    // present in the Expasy FTP FASTA file.
                    // Is formatted something like this:
                    //  >accession|ID descr rest (including taxonomy, if available)
                    result.iAccession = aFASTAHeader.substring(0, aFASTAHeader.indexOf("|")).trim();
                    // See if there is location information.
                    if (aFASTAHeader.matches("[^\\(]+\\([\\d]+ [\\d]\\)$")) {
                        int openBracket = aFASTAHeader.indexOf("(");
                        result.iAccession = aFASTAHeader.substring(0, openBracket).trim();
                        result.iStart = Integer.parseInt(aFASTAHeader.substring(openBracket, aFASTAHeader.indexOf(" ", openBracket)).trim());
                        result.iEnd = Integer.parseInt(aFASTAHeader.substring(aFASTAHeader.indexOf(" ", openBracket), aFASTAHeader.indexOf(")")).trim());
                    }
                    result.databaseType = DatabaseType.UniProt;
                    result.iID = "sw"; // @TODO: remove hardcoding?
                    result.iDescription = aFASTAHeader.substring(aFASTAHeader.indexOf("|") + 1);

                    // try to get the gene name and taxonomy
                    parseUniProtDescription(result);
} 

At result.iAccession = aFASTAHeader.substring(0, aFASTAHeader.indexOf("|")).trim(); the accession is parsed as 'sp' and this generates accession duplication error since there are multiple 'sp' accessions.

The header line does not get caught where, I think, it should:

} else if (aFASTAHeader.matches("^sp\\|[^|]*\\|[^\\s]+_[^\\s]+ .*")) {
                    // New (September 2008 and beyond) standard SwissProt header as
                    // present in the Expasy FTP FASTA file.
                    // Is formatted something like this:
                    //  >sp|accession|ID descr rest (including taxonomy, if available)
                    String tempHeader = aFASTAHeader.substring(3);
                    result.iAccession = tempHeader.substring(0, tempHeader.indexOf("|")).trim();
                    // See if there is location information.
                    if (result.iAccession.matches("[^\\(]+\\([\\d]+ [\\d]\\)$")) {
                        int openBracket = result.iAccession.indexOf("(");
                        result.iStart = Integer.parseInt(result.iAccession.substring(openBracket, result.iAccession.indexOf(" ", openBracket)).trim());
                        result.iEnd = Integer.parseInt(result.iAccession.substring(result.iAccession.indexOf(" ", openBracket), result.iAccession.indexOf(")")).trim());
                        result.iAccession = result.iAccession.substring(0, openBracket).trim();
                    } else if (result.iAccession.matches("[^\\(]+\\([\\d]+-[\\d]+\\)$")) {
                        int openBracket = result.iAccession.indexOf("(");
                        result.iStart = Integer.parseInt(result.iAccession.substring(openBracket + 1, result.iAccession.indexOf("-", openBracket)).trim());
                        result.iEnd = Integer.parseInt(result.iAccession.substring(result.iAccession.indexOf("-", openBracket) + 1, result.iAccession.indexOf(")")).trim());
                        result.iAccession = result.iAccession.substring(0, openBracket).trim();
                    }
                    result.databaseType = DatabaseType.UniProt;
                    result.iID = "sp";
                    result.iDescription = tempHeader.substring(tempHeader.indexOf("|") + 1);

                    // try to get the gene name and taxonomy
                    parseUniProtDescription(result);

}

At the moment, I am having the user remove the sihumi_ or replace the '_' with '-'. In each case, the header is processed at the final else. Still not in the 'sp' section.

hbarsnes commented 5 years ago

The header is attempted parsed as the type indicated by the header content, which for the header in question is assumed to be from SwissProt. But given that the altered header does not follow the proper rules of SwissProt headers the parsing breaks down.

Therefore, when using custom headers, or when making changes to standard headers, we recommend using our non-standard header format instead: https://github.com/compomics/searchgui/wiki/DatabaseHelp#non-standard-fasta. This will ensure that custom headers are parsed correctly.

tmcgowan commented 5 years ago

Thanks.