Closed Thyra closed 2 years ago
domain step fails with the following message:
java.lang.IllegalArgumentException: You have submitted a protein sequence which contains an asterix (*). This may be from an ORF prediction program. '*' is not a valid IUPAC amino acid character and amino acid sequences which go through our pipeline should not contain it. Please strip out all asterix characters from your sequence and resubmit your search.
seqsim, fanngo, and mixmeth-blast run through fine
The aggregate step failed, probably because there is no domain GAF file. We will try aggregating with the domain GAF from the modified output (*s removed) and see if that resolves it. If it does, we should submit an Issue to Gomap that it would be nice to have an error message in that case (e.g. if somebody forgets to run the domain step or doesn't notice that it fails)
The mixmeth step failed (in both the input with and without *). The errors reported were:
ImportError: cannot import name NCBIStandalone
File "/usr/lib/python2.7/subprocess.py", line 190, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['python', 'run.py', '/workdir/barley_MNA/GOMAP-160517_barley_NEW_MNA/tmp/mixed-meth/pannzer/conf/160517_barley_NEW_MNA.8.conf']' returned non-zero exit status 1
The get_longest_transcript.py seems to remove non-iupac chars: https://github.com/Dill-PICL/GOMAP/blob/master/code/utils/get_longest_transcript.py#L44 We used it for every input except barley.
The get_longest_transcript.py does remove non-iupac characters. One possible change that maybe useful would be for get_longest_transcript.py to throw a warning if a sequence has an asterisk in the middle. These asterisks could be due to a sequencing error or the data could have been manipulated improperly. Currently, based on test cases, if there is an asterisk in the middle of a sequence, it is removed and the rest of the sequence is continued. This may not be the best response, a warning may be useful to researchers so they can decide if the sequence should be kept with the middle asterisk removed, if the sequence should be discarded, or some other action.
It seems that at least some tools created different annotations when the * was present vs when it was not. I don't just mean additional ones, but actually the GO terms were different. Colleen is checking it out.
done