Closed hysong0921 closed 3 months ago
Hi @hysong0921 ,
I believe that this error is due to the wheat genome having more than 100k protein sequences, which is the limit of hmmsearch
used by resistify
.
To resolve this, I would advise splitting the input fasta file and running resistify
individually on each of these, then merging the results files downstream of this.
I think even just splitting the fasta file into two files of ~50k sequences should be sufficient?
I have not encountered this limit before - next time I update resistify
I'll add a function to handle this internally!
Thanks for letting me know.
Thankyou for your advise actually i am fresh in Bioinformatics,your work is truly outstanding! You helped me solve a big problem. In fact, most wheat protein sequences aren't that large, with only a few exceptions. Previously, I was using Pfam entries for NB-ARC to screen for NLR, which was both cumbersome and inefficient. Thanks to your results, I've completed more than two weeks' worth of work in just a few days! I have two more questions for you:
When I was using another wheat genome, Chinese Spring, I encountered the following error:
ERROR Error running jackhmmer: nlrexpress.py:101 Stderr: Fatal exception (source file p7_alidisplay.c, line 1315): backconverted trace didn't end at expected place on model (A0A0E0PWH5/TraesCS6B03G1059900LC.1)
Stdout:# jackhmmer :: iteratively search a protein sequence against a protein database
Query: TraesCS6B03G0739800LC.1 [L=615]
Scores for complete sequences (score includes all domains): --- full sequence --- --- best 1 domain --- -#dom- E-value score bias E-value score bias exp N Sequence Description
A0A6L2MNW4
score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to
acc
1 ! 20.0 0.3 5.7e-06 1.8e-05 278 382 .. 807 912 .. 782 923 .. 0.77
A0A2N9GIK0 [No individual domains that satisfy reporting thresholds (although complete target did)]
A0A6P8BUS7 [No individual domains that satisfy reporting thresholds (although complete target did)]
A0A2I0HQ87 [No individual domains that satisfy reporting thresholds (although complete target did)]
A0A446Z0E9 [No individual domains that satisfy reporting thresholds (although complete target did)]
A0A2Z7BLM6
score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to
acc
1 ! 21.5 0.1 2e-06 6.3e-06 96 201 .. 327 433 .. 243 471 .. 0.78
A0A072UA60
score
this is my code : resistify --threads 20 /dt1/hysong/02.genome/wheat/protein/iwgsc_refseqv2.1_annotation_200916_ALL_pep.fasta /dt1/hysong/04.Resistify_NLR_result/Triticum_aestivum_cs I don't know why this problem occurred. The protein sequence for this wheat genome is about 90Mb, but when I tried a protein sequence of 87Mb, the program ran correctly. 2.To confirm, is the threshold used to screen for NLRs genes 1e-5?
I'm glad you're patient with a newbie, I really appreciate it and believe me, there are already a lot of famous PI's in China who are looking forward to your work because a lot of people are reading your biorxiv and pushing it to me to get me started!wish you success!
Hi @hysong0921 ,
No worries! I am glad to hear it is helpful.
That error is slightly more cryptic... I believe it is linked to this error: https://github.com/EddyRivasLab/hmmer/issues/153
Do any of your sequences contain internal stop codons ( "*" ) or other non-standard symbols? resistify
will remove stop codons at the end of sequences, but it currently does not handle internal stops. If you download seqkit you could run:
seqkit grep --by-seq --invert-match --pattern '*' ${your_sequence}
And it will print any sequences that might be causing this.
Alternatively if you are happy to send me the file which is causing problems I can take a look at it myself.
Thanks!
Thank you for your response. I found the cause of the issue. The sequence in question comes from the Chinese Spring wheat v2.1 protein sequence, available at the following URL: https://urgi.versailles.inra.fr/download/iwgsc/IWGSC_RefSeq_Annotations/v2.1/iwgsc_refseqv2.1_gene_annotation_200916.zip. When I used the following command on the merged protein sequence: grep -v ">" /dt1/hysong/02.genome/wheat/protein/iwgsc_refseqv2.1_annotation_200916_ALL_pep.fasta | grep "." | less I found many "." characters. The problematic gene sequence is as follows:
TraesCS6B03G1059900LC.1 MLIFICPCYRYFVVVDDIWNVNTWDIINCAFPATSSTSRIITTTRINSVAHSCCSSFSVRIYNIRALDMV HSR.LFHRRLFKSDEDCPSYLQEISELILKCHGLPLAIIAISGLLANIEKIEHL.NHVKDSIGLALERNS SVEGMMKILSLSYFDLPRYLKTCLLYLSMYLEDSTIAKYGLIRRWIAEGFIQTKGRYMAYELGERCFNEL LNKGLIQPGRTDNYGIVMSCRVHDTKLDFIIFKSIEQNFVTLLGVPILTIGNKAKVRRLCLQGVKEGIST VLIADLVFSHVRSLTMVRGLLEIPSLEKFKHLRVLNLMYCSELEDHHLDNIVRLFQLRYLNLKHTRIRKL PEQIGHLRCLEMLDLQHTCVEELPASIVNLRLMDLLVGDDVKFPEGIAKMQALETLEHVNAVIQPLDFLC GLGQLMNLRNLGLTLDFDFDSEIEDTDVECKKCILSSLCKLGIQSLRSLTIWHEDCSLLHGESLCLPTLE VLII.YVPAFRQFPTWVVSLRNLQRLHLAVEGLK.DDLCILGLPSLLVLHLGGERESKGKLRINGEVGFR PLKIFIYHARSKPVDLMFGAVSMPKLEKLELHLLRLVEADSPGFGIGNIPCLSIITFITVRGDHDIVEAV KIAMKNRPSVL Everything worked fine when I removed the "." characters from the sequence. Thank you for your suggestion.
Hi @hysong0921
Great I'm glad that worked! I will add some warnings to resistify
if it detects non-standard symbols in the input. Do you have any further questions? Otherwise I will close this issue :)
Thank you again for your work! When I used the command line to remove the ".", it also removed the "." in the gene IDs, which wasn't what I wanted, but it didn't affect anything. Your work is excellent!
P.S.: I currently live in Beijing, and if you don't mind, I'd like to send you some interesting little gifts by mail as a token of my appreciation! My email is haoyuansong2000@163.com. If you don't mind, please send me your mailing address via email! Thank you, and I don't have any more questions.
hi i have a problem in $ resistify --threads 20 /dt1/hysong/02.genome/wheat/protein/Triticum_aestivum_Kariega_v1-prot.fasta /dt1/hysong/04.Resistify_NLR_result/Triticum_aestivum_Kariega_v1
[10:29:53] INFO Welcome to Resistify version 0.2.0! main.py:34 [10:30:42] INFO Running hmmsearch... hmmsearch.py:22 [10:30:43] ERROR Error running hmmsearch: hmmsearch.py:33 Stderr: Fatal exception (source file p7_pipeline.c, line 697):
Target sequence length > 100K, over comparison pipeline limit.
(Did you mean to use nhmmer/nhmmscan?)
when i use $ resistify --threads 20 --nhmmer /dt1/hysong/02.genome/wheat/protein/Triticum_aestivum_Kariega_v1-prot.fasta /dt1/hysong/04.Resistify_NLR_result/Triticum_aestivum_Kariega_v1
[10:45:09] INFO Welcome to Resistify version 0.2.0! main.py:34 Usage: resistify [-h] [-t THREADS] [--ultra] [--chunksize CHUNKSIZE] [--evalue EVALUE] [--lrr_gap LRR_GAP] [--lrr_length LRR_LENGTH] [--duplicate_gap DUPLICATE_GAP] input outdir resistify: error: unrecognized arguments: --nhmmer can i split 100k each then combine ? or other ways? thanyou