CCMS-UCSD / GNPS_Workflows

Public Workflows at GNPS
https://gnps.ucsd.edu/
Other
54 stars 44 forks source link

MetaMiner #471

Open HGuo-HKI opened 4 years ago

HGuo-HKI commented 4 years ago

https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=65694aaf25964885930fd3e4145b4fbb

thank you very much!

lfnothias commented 4 years ago

Hey @alexeigurevich can you please look into that issue ?

alexeigurevich commented 4 years ago

Hi all! Thanks for reporting the issue. The problem is that the current version of MetaMiner can't accept genome sequences in regular .gbk format (see below what I mean by "regular").

The currently accepted sequence formats are:

  1. raw nucleotide sequences in .fasta format (a high-quality reference or a draft assembly)
  2. antiSMASH's .final.gbk or .gbk file (it contains specific tags like sec_met and biosynthetic which are essential for MetaMiner)
  3. BOA's .annotated.txt file
  4. predicted and translated RiPP amino acid sequences in .fasta format (e.g. extracted from BOA, or antiSMASH, or other prediction tool output)

see more details in the online documentation (the link is on the workflow configuration page).

We recommend option (1), in this case MetaMiner searches the entire genome for specific motifs related to various RiPP classes (cyanobactins, linardins, etc). In this search, MetaMiner tries all 6 possible translation frames to convert nucleotides into amino acids.

In your particular case, MetaMiner tried to interpret the .gbk file as option (2) and since your .gbk does not contain the specific AntiSMASH output tags, the workflow crashed.

The current workaround is to convert .gbk into a FASTA file (or download the genome from NCBI in FASTA format directly). You can do the conversion online, e.g. here. There are two options -- Extract Individual Features (as Amino Acid sequences or Nucleotide Sequences) or Extract Whole Sequence (Nucleotides). If you choose the former and get amino acid FASTA, it will be interpreted by MetaMiner as option (4) and the processing will be very slow! Since all your amino acid sequences will be considered as potential RiPPs and thoroughly scored while only a small fraction of all CDS usually encode real RiPPs. If you choose the latter (Extract Whole Sequence) you will end up with MetaMiner option (1) which is the recommended way to run the tool.

I downloaded your GBK file and converted it using both ways. After that, I restarted the GNPS jobs as option (1) (nucleotide, full genome) and option (4) (protein, only features). The first job was completed very fast and found one relatively good match (albeit still not very trustable since it is just above the minimum quality threshold). The second job is still running for more than 7 hours.

My thoughts on this issue and future MetaMiner releases:

  1. We will accept "regular" .gbk files, parse them, and if they are not AntiSMASH output files then treat them as in option (1). I.e. we will do the conversion from GBK to nucleotide sequence automatically.
  2. We will add a nice error message if the genome analysis step has failed (i.e. no potential RiPPs were found as in your example).
HGuo-HKI commented 4 years ago

Thank you for your kind help!

I am sorry that I have to report the failed result again, since I can make it following your suggestions: The sequence file is uploaded as .fasta derived from antiSMASH .gbk result after conversion online.

https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=af523f0c813e42ba801a242fc6c4af55 https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=38483845bcca4d76a05e8efcea14f609 https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=d59ca9169cad4c9182087f3520feb421

Thank you for your kind care in advance!