SwiftSeal / resistify

Resistify is a program which rapidly identifies and classifies plant resistance genes from protein sequences. It is designed to be lightweight and easy to use.
GNU General Public License v3.0
26 stars 2 forks source link

hmmer #11

Closed hysong0921 closed 3 months ago

hysong0921 commented 3 months ago

hi i have a problem in $ resistify --threads 20 /dt1/hysong/02.genome/wheat/protein/Triticum_aestivum_Kariega_v1-prot.fasta /dt1/hysong/04.Resistify_NLR_result/Triticum_aestivum_Kariega_v1
[10:29:53] INFO Welcome to Resistify version 0.2.0! main.py:34 [10:30:42] INFO Running hmmsearch... hmmsearch.py:22 [10:30:43] ERROR Error running hmmsearch: hmmsearch.py:33 Stderr: Fatal exception (source file p7_pipeline.c, line 697):
Target sequence length > 100K, over comparison pipeline limit.
(Did you mean to use nhmmer/nhmmscan?)

                Stdout:     

when i use $ resistify --threads 20 --nhmmer /dt1/hysong/02.genome/wheat/protein/Triticum_aestivum_Kariega_v1-prot.fasta /dt1/hysong/04.Resistify_NLR_result/Triticum_aestivum_Kariega_v1

[10:45:09] INFO Welcome to Resistify version 0.2.0! main.py:34 Usage: resistify [-h] [-t THREADS] [--ultra] [--chunksize CHUNKSIZE] [--evalue EVALUE] [--lrr_gap LRR_GAP] [--lrr_length LRR_LENGTH] [--duplicate_gap DUPLICATE_GAP] input outdir resistify: error: unrecognized arguments: --nhmmer can i split 100k each then combine ? or other ways? thanyou

SwiftSeal commented 3 months ago

Hi @hysong0921 ,

I believe that this error is due to the wheat genome having more than 100k protein sequences, which is the limit of hmmsearch used by resistify. To resolve this, I would advise splitting the input fasta file and running resistify individually on each of these, then merging the results files downstream of this. I think even just splitting the fasta file into two files of ~50k sequences should be sufficient?

I have not encountered this limit before - next time I update resistify I'll add a function to handle this internally! Thanks for letting me know.

hysong0921 commented 3 months ago

Thankyou for your advise actually i am fresh in Bioinformatics,your work is truly outstanding! You helped me solve a big problem. In fact, most wheat protein sequences aren't that large, with only a few exceptions. Previously, I was using Pfam entries for NB-ARC to screen for NLR, which was both cumbersome and inefficient. Thanks to your results, I've completed more than two weeks' worth of work in just a few days! I have two more questions for you:

When I was using another wheat genome, Chinese Spring, I encountered the following error:

ERROR Error running jackhmmer: nlrexpress.py:101 Stderr: Fatal exception (source file p7_alidisplay.c, line 1315): backconverted trace didn't end at expected place on model (A0A0E0PWH5/TraesCS6B03G1059900LC.1)

Stdout:# jackhmmer :: iteratively search a protein sequence against a protein database

HMMER 3.3.2 (Nov 2020); http://hmmer.org/

Copyright (C) 2020 Howard Hughes Medical Institute.

Freely distributed under the BSD open source license.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

query sequence file: /tmp/tmpz96ipg27/chunk_563.fasta

target sequence database: /tmp/tmpz96ipg27/nlrexpress.fasta

maximum iterations set to: 2

HMM checkpoint files output: /tmp/tmpz96ipg27/chunk_563.fasta.out-.hmm

show alignments in output: no

sequence reporting threshold: E-value <= 1e-05

domain reporting threshold: E-value <= 1e-05

number of worker threads: 4

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Query: TraesCS6B03G0739800LC.1 [L=615]

Scores for complete sequences (score includes all domains): --- full sequence --- --- best 1 domain --- -#dom- E-value score bias E-value score bias exp N Sequence Description


  • 3.4e-218 723.1 33.2 1.6e-36 123.2 0.2 8.1 6 A0A4D6NGK9
  • 2.4e-154 512.2 26.9 6.3e-40 134.4 0.5 7.7 4 A0A0E0PWH5
  • 4.7e-131 435.3 6.7 1e-60 203.1 0.4 3.2 3 A0A0D3H065
  • 4.3e-86 286.9 4.2 1.2e-53 179.8 0.2 4.5 3 A0A3B6PU10
  • 2.8e-81 271.0 0.3 9.8e-31 104.1 0.0 4.4 3 A0A3Q7GEF2
  • 5.7e-72 240.2 5.1 2.8e-32 109.2 0.0 4.7 4 A0A0E0F7D9 ....................................................................... .......................................................................
  • 9.3e-09 31.5 0.0 1.5e-08 30.8 0.0 1.2 1 A0A2N9J2Y0
  • 9.4e-09 31.5 0.0 1.5e-08 30.8 0.0 1.2 1 A0A4U5Q2J7
  • 1e-08 31.4 1.0 5.9e-08 28.8 0.1 2.0 1 A0A679BCK6
  • 1.1e-08 31.2 0.0 1.9e-08 30.5 0.0 1.2 1 A0A0P0WJQ7
  • 1.3e-08 31.0 0.0 7.5e-05 18.6 0.0 2.1 0 A0A4S4EDJ3
  • 1.4e-08 30.9 0.0 2.2e-08 30.3 0.0 1.2 1 A0A396JAF7
  • 1.7e-08 30.6 0.3 1.5e-05 20.9 0.1 2.1 1 A0A2H5QKL3

A0A6L2MNW4

score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to

acc



1 ! 20.0 0.3 5.7e-06 1.8e-05 278 382 .. 807 912 .. 782 923 .. 0.77

A0A2N9GIK0 [No individual domains that satisfy reporting thresholds (although complete target did)]

A0A6P8BUS7 [No individual domains that satisfy reporting thresholds (although complete target did)]

A0A2I0HQ87 [No individual domains that satisfy reporting thresholds (although complete target did)]

A0A446Z0E9 [No individual domains that satisfy reporting thresholds (although complete target did)]

A0A2Z7BLM6

score bias c-Evalue i-Evalue hmmfrom hmm to alifrom ali to envfrom env to

acc



1 ! 21.5 0.1 2e-06 6.3e-06 96 201 .. 327 433 .. 243 471 .. 0.78

A0A072UA60

score

this is my code : resistify --threads 20 /dt1/hysong/02.genome/wheat/protein/iwgsc_refseqv2.1_annotation_200916_ALL_pep.fasta /dt1/hysong/04.Resistify_NLR_result/Triticum_aestivum_cs I don't know why this problem occurred. The protein sequence for this wheat genome is about 90Mb, but when I tried a protein sequence of 87Mb, the program ran correctly. 2.To confirm, is the threshold used to screen for NLRs genes 1e-5?

I'm glad you're patient with a newbie, I really appreciate it and believe me, there are already a lot of famous PI's in China who are looking forward to your work because a lot of people are reading your biorxiv and pushing it to me to get me started!wish you success!

SwiftSeal commented 3 months ago

Hi @hysong0921 ,

No worries! I am glad to hear it is helpful.

That error is slightly more cryptic... I believe it is linked to this error: https://github.com/EddyRivasLab/hmmer/issues/153 Do any of your sequences contain internal stop codons ( "*" ) or other non-standard symbols? resistify will remove stop codons at the end of sequences, but it currently does not handle internal stops. If you download seqkit you could run:

seqkit grep --by-seq --invert-match --pattern '*' ${your_sequence}

And it will print any sequences that might be causing this.

Alternatively if you are happy to send me the file which is causing problems I can take a look at it myself.

Thanks!

hysong0921 commented 3 months ago

Thank you for your response. I found the cause of the issue. The sequence in question comes from the Chinese Spring wheat v2.1 protein sequence, available at the following URL: https://urgi.versailles.inra.fr/download/iwgsc/IWGSC_RefSeq_Annotations/v2.1/iwgsc_refseqv2.1_gene_annotation_200916.zip. When I used the following command on the merged protein sequence: grep -v ">" /dt1/hysong/02.genome/wheat/protein/iwgsc_refseqv2.1_annotation_200916_ALL_pep.fasta | grep "." | less I found many "." characters. The problematic gene sequence is as follows:

TraesCS6B03G1059900LC.1 MLIFICPCYRYFVVVDDIWNVNTWDIINCAFPATSSTSRIITTTRINSVAHSCCSSFSVRIYNIRALDMV HSR.LFHRRLFKSDEDCPSYLQEISELILKCHGLPLAIIAISGLLANIEKIEHL.NHVKDSIGLALERNS SVEGMMKILSLSYFDLPRYLKTCLLYLSMYLEDSTIAKYGLIRRWIAEGFIQTKGRYMAYELGERCFNEL LNKGLIQPGRTDNYGIVMSCRVHDTKLDFIIFKSIEQNFVTLLGVPILTIGNKAKVRRLCLQGVKEGIST VLIADLVFSHVRSLTMVRGLLEIPSLEKFKHLRVLNLMYCSELEDHHLDNIVRLFQLRYLNLKHTRIRKL PEQIGHLRCLEMLDLQHTCVEELPASIVNLRLMDLLVGDDVKFPEGIAKMQALETLEHVNAVIQPLDFLC GLGQLMNLRNLGLTLDFDFDSEIEDTDVECKKCILSSLCKLGIQSLRSLTIWHEDCSLLHGESLCLPTLE VLII.YVPAFRQFPTWVVSLRNLQRLHLAVEGLK.DDLCILGLPSLLVLHLGGERESKGKLRINGEVGFR PLKIFIYHARSKPVDLMFGAVSMPKLEKLELHLLRLVEADSPGFGIGNIPCLSIITFITVRGDHDIVEAV KIAMKNRPSVL Everything worked fine when I removed the "." characters from the sequence. Thank you for your suggestion.

SwiftSeal commented 3 months ago

Hi @hysong0921

Great I'm glad that worked! I will add some warnings to resistify if it detects non-standard symbols in the input. Do you have any further questions? Otherwise I will close this issue :)

hysong0921 commented 3 months ago

Thank you again for your work! When I used the command line to remove the ".", it also removed the "." in the gene IDs, which wasn't what I wanted, but it didn't affect anything. Your work is excellent!

P.S.: I currently live in Beijing, and if you don't mind, I'd like to send you some interesting little gifts by mail as a token of my appreciation! My email is haoyuansong2000@163.com. If you don't mind, please send me your mailing address via email! Thank you, and I don't have any more questions.