GOMAP behavior with * in the FASTA sequence

Dill-PICL / GOMAP-Paper-2019.1

Data, code, and results for our paper Standardized genome-wide function prediction enables comparative functional genomics: a new application area for Gene Ontologies in plants (https://doi.org/10.1093/gigascience/giac023). If you use any of our code or results in a scientific publication, we would be grateful if you cite the paper.

Creative Commons Zero v1.0 Universal

0 stars 1 forks source link

GOMAP behavior with * in the FASTA sequence #21

Closed Thyra closed 2 years ago

Thyra commented 5 years ago

[ ] Find out if any of the tools yield different results when there are * present in the sequence
[ ] Come up with a regex to find at the end of sequences and another one within them, but ignoring s in the headers
[ ] Develop ideas how the pipeline should react to each of these cases (ignore, complain, cleanup, fail ...)
[ ] Submit an issue in the GOMAP repo to discuss and implement desired pipeline behavior.

Thyra commented 5 years ago

domain step fails with the following message:

java.lang.IllegalArgumentException: You have submitted a protein sequence which contains an asterix (*). This may be from an ORF prediction program. '*' is not a valid IUPAC amino acid character and amino acid sequences which go through our pipeline should not contain it. Please strip out all asterix characters from your sequence and resubmit your search.

seqsim, fanngo, and mixmeth-blast run through fine

Thyra commented 5 years ago

The aggregate step failed, probably because there is no domain GAF file. We will try aggregating with the domain GAF from the modified output (*s removed) and see if that resolves it. If it does, we should submit an Issue to Gomap that it would be nice to have an error message in that case (e.g. if somebody forgets to run the domain step or doesn't notice that it fails)

CFYanarella commented 5 years ago

The mixmeth step failed (in both the input with and without *). The errors reported were:

ImportError: cannot import name NCBIStandalone

File "/usr/lib/python2.7/subprocess.py", line 190, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['python', 'run.py', '/workdir/barley_MNA/GOMAP-160517_barley_NEW_MNA/tmp/mixed-meth/pannzer/conf/160517_barley_NEW_MNA.8.conf']' returned non-zero exit status 1

Thyra commented 5 years ago

The get_longest_transcript.py seems to remove non-iupac chars: https://github.com/Dill-PICL/GOMAP/blob/master/code/utils/get_longest_transcript.py#L44 We used it for every input except barley.

CFYanarella commented 5 years ago

The get_longest_transcript.py does remove non-iupac characters. One possible change that maybe useful would be for get_longest_transcript.py to throw a warning if a sequence has an asterisk in the middle. These asterisks could be due to a sequencing error or the data could have been manipulated improperly. Currently, based on test cases, if there is an asterisk in the middle of a sequence, it is removed and the rest of the sequence is continued. This may not be the best response, a warning may be useful to researchers so they can decide if the sequence should be kept with the middle asterisk removed, if the sequence should be discarded, or some other action.

Thyra commented 5 years ago

It seems that at least some tools created different annotations when the * was present vs when it was not. I don't just mean additional ones, but actually the GO terms were different. Colleen is checking it out.

Thyra commented 2 years ago

done