Closed ipezoa closed 1 year ago
One or more headers is triggering the parsing of sequences from TAIR. It means you have a format that is not compatible with the program. I suggest going over all sequences that are not from UniProt, NCBI or ENSEMBL, and make them as simple as possible, like I describe it in the wiki.
@prvst It seems this is a recurring theme. Maybe it could be worth it to implement some header validation so the program fails in a more productive way and the user can see immediately what's incompatible with the analysis.
I think I can report the header with the error message, but chances are more than one will be wrong. Since we found the problem I'll close the issue, but feel free to open it again if you need help with the formatting described in the Wiki.
The point I was trying to make is that now the error states "panic: runtime error: index out of range [1] with length 1", but it could be something like "The header of one or more protein entries do not comply with the required format. For further details see [insert link to wiki]." That would already be much better than the index out of range error that is non-trivial to debug for normal users.
I encountered this error recently, with the following FASTA file entry:
>AT10A_MOUSE Probable phospholipid-transporting ATPase VA; EC=3.6.3.1; AltName: Full=ATPase class V type 10A; AltName: Full=P-locus fat-associated ATPase; Accession=O54827; Q8R3B8; [Mus musculus (Mouse)]
MERELPAAEESASSGWRRPRRRRWEGRTRTVRSNLLPPLGTEDSTIGAPKGERLLMRGCI
I have worked with FASTA files for 15+ years and I can attest that protein header lines (>name description) vary widely with all sorts of variation, including often times only containing a name and no description. When writing software that reads FASTA files, it is good practice to never make assumptions and continually check for splits or RegEx matches not fitting the expected pattern, falling back to simpler processing of the protein header line if necessary.
I agree with @fabianegli that it's unfortunate that your code does not check whether the protein name actually fits the expected pattern, and instead reports an index out of range error message. I would suggest you add an array length check after splitting on the vertical bar here: https://github.com/Nesvilab/philosopher/blob/master/lib/dat/db.go#L483
If the array length is less than 3, return a generic protein record of just the protein name (using text up to the first space, or the end of the line). Or, show the error but point people to the documentation at https://github.com/Nesvilab/philosopher/wiki/How-to-Prepare-a-Protein-Database#header-formatting
Another thing you could do is to make the following check more sophisticated, by using a RegEx to assure the line really is a tair protein instead of "does it start with AT?": https://github.com/Nesvilab/philosopher/blob/master/lib/dat/db.go#L599
Two possible regular expression tests:
>AT[^|]+\|[^|]+\|
>AT[^|]+\| *Symbols:[^|]+\|
For my part, I will add some logic to our automation code to auto-rename proteins that start with "AT"
Thank you for suggesting that we read over the FASTA file header formatting documentation
Here's a similar error, which I'll append here in case other people see this error message and search for a solution:
panic: runtime error: index out of range [1] with length 0
goroutine 1 [running]:
philosopher/lib/dat.ProcessNCBI({0xc0015af3b1, 0xe8}, {0xc001316000, 0x26f}, {0xc000020120, 0x4})
/workspace/philosopher/lib/dat/db.go:129 +0x5cc
philosopher/lib/dat.(*Base).ProcessDB(0xc000032180, {0xc00001c390, 0x30}, {0xc000020120, 0x4})
/workspace/philosopher/lib/dat/dat.go:139 +0x2f0
That came from this protein entry in my FASTA file (which is actually a combination of several individual FASTA files with a variety of formats):
>ZP1_MOUSE Zona pellucida sperm-binding protein 1; AltName: Full=Zona pellucida glycoprotein 1; Short=Zp-1; Contains: Processed zona pellucida sperm-binding protein 1; Flags: Precursor; Accession=Q62005; Q62016; [Mus musculus (Mouse)]
MAWGCFVVLLLLAAAPLRLGQRLHLEPGFEYSYDCGVRGMQLLVFPRPNQTVQFKVLDEF
It got flagged as being an NCBI entry due to starting with ZP
, leading to a pattern matching error in the ProcessNCBI method
https://github.com/Nesvilab/philosopher/blob/master/lib/dat/db.go#L116
Hi Matthew. Thanks for the input. I have similar experience with FASTA files, and I agree with you on the format range. This is exactly why we cannot have a strict regex rule for parsing headers; even public databases like UniProt have variations, and people sometimes add custom tags to the headers (we had collaborators, for example, using UniProt entries, but with custom prefixes, instead of sp or tr). We had to find a middle ground between some formal structures and custom ones, hence the loose methods to identify the formats. This is also why my recommendation is that if you are not following one of the common formats strictly, do not change it, and just go with the "generic" format.
@prvst wouldn't it be an option to allow the users to supply their own regex to parse headers instead of guessing? It is generally considered good practice to give the users a choice when there is a choice to be made - especially if the alternative is guessing. There is even precedence for this in the field of proteomics in the MaxQuant FASTA database parsing where users chan choose and edit the "identifier rule". Here's a picture of this feature:
I am not sure if go has named capture grpups in its regexes, but they would be one way to allow the regex itself to define what's the id, name and other properties.
Perhaps, but it would be necessary to request multiple regexes since we have to capture different elements from the headers to provide a full report. Things like the gene names, organism names, description, etc. We also have to keep in mind the way how third-party tools like the prophets parse the headers. Regardless, I'm changing the logic a little bit. I'll have something for the next release
it would be necessary to request multiple regexes since we have to capture different elements from the headers to provide a full report. Things like the gene names, organism names, description, etc.
Multiple regexes could be a solution (if a multistep procedure is chosen for parsing a header - like splitting on |
and then working on the resulting substrings), but one could suffice if named capture groups were to be used. This way it would be clear what has been found and what not. The found matches could also be checked against expectations in addition to being used for the report. Admittedly such "one covers all" regexes can be daunting and are prone to errors.
It seems that named capture groups are supported by golang. See here.
I'm running this script to format a custom DB, but there's always the same error.
./philosopher workspace --init ./philosopher database --custom DB.fasta --contam
The error:
time="10:36:07" level=info msg="Executing Workspace v4.8.1" time="10:36:08" level=info msg="Creating workspace" time="10:36:08" level=info msg=Done time="10:36:08" level=info msg="Executing Database v4.8.1" time="10:36:08" level=info msg="Generating the target-decoy database" time="11:52:14" level=info msg="Creating file" panic: runtime error: index out of range [1] with length 1
goroutine 1 [running]: philosopher/lib/dat.ProcessTair({0xecb8b1a1d1, 0xe}, {0xecb8b1dd40, 0xb2}, {, }) /workspace/philosopher/lib/dat/db.go:493 +0x3bf philosopher/lib/dat.(Base).ProcessDB(0xc0000d2360, {0xd6a04b46e0, 0x50}, {0xb373ca, 0x4}) /workspace/philosopher/lib/dat/dat.go:174 +0x92f philosopher/lib/dat.Run({{0xc00002e6c0, 0x24}, {0xc00002e6f0, 0x2a}, {0xc00002e720, 0x29}, {0xc00002c3c0, 0x39}, {0xc00002e750, 0x30}, ...}) /workspace/philosopher/lib/dat/dat.go:117 +0x6fd philosopher/cmd.glob..func4(0x26108e0?, {0xb3726a?, 0x3?, 0x3?}) /workspace/philosopher/cmd/database.go:23 +0x7e github.com/spf13/cobra.(Command).execute(0x26108e0, {0xc00013eed0, 0x3, 0x3}) /home/prvst/go/pkg/mod/github.com/spf13/cobra@v1.6.1/command.go:920 +0x847 github.com/spf13/cobra.(Command).ExecuteC(0x260f4c0) /home/prvst/go/pkg/mod/github.com/spf13/cobra@v1.6.1/command.go:1044 +0x3bd github.com/spf13/cobra.(Command).Execute(...) /home/prvst/go/pkg/mod/github.com/spf13/cobra@v1.6.1/command.go:968 philosopher/cmd.Execute() /workspace/philosopher/cmd/root.go:35 +0x25 main.main() /workspace/philosopher/main.go:25 +0x90
Despite the error, the database with decoys and contamination is created, but it does not work when used in fragpipe. It only generates the intermediate files such as ".pep.xml", ".pepXML", ".pin", but not the final outputs in tsv, like "psm.tsv".
I read that it could be a problem regarding the headers, so I already tried changing the headers of my fasta, by leaving just the IDs. I removed everything else from the headers to avoid conflicts with special characters, but I keep getting the same error. I would attach my DB here, but it is very big (about 62GB). It is comprised of sequences from Swissprot, Trembl, and NCBI Non Redundant Proteins, so I don't know why it is having an issue.
Any clues of what is happening?