CCMS-UCSD / GNPS_Workflows

Public Workflows at GNPS
https://gnps.ucsd.edu/
Other
54 stars 44 forks source link

[RiPPquest] workflow needs documentation - Test dataset would be useful #164

Open amcaraballor opened 5 years ago

amcaraballor commented 5 years ago

Is there any dataset that can be used for new users to learn and test the workflow? The following files will be necessary: Spectrum Files (Required): mzML, mzXML, mgf? Sequence Files (Required): fasta? Spectra-Sequence Correspondence File (Optional): format? .csv, .tsv, .txt? template available?

mwang87 commented 5 years ago

@alexeigurevich shoudl be able to answer that

alexeigurevich commented 5 years ago

Yes, it's my bad regarding the documentation, plan to create it soon. There is more or less complete documentation for the command-line version of the tool (here) but not for the GNPS workflow. Note that in the command-line version we renamed RiPPquest to MetaMiner and improved it in some aspects, the new publication will come out in Cell Systems soon, by that time I plan to update the GNPS workflow with the new functionality as well (and also rename it to MetaMiner).

You can find small sample data for RiPPquest/MetaMiner in our GitHub repo here. For convenience, I attached here an archive with all these files (spectra and sequence) and a correspondence file (not available in the repo): RiPPquest_test_data.zip. For this data, you can use "Running mode: high-high" (default is "high-low"). Sample job with this data is here.

You are right regarding all the file extensions:

Spectrum Files (Required): mzML, mzXML, mgf?

We natively support mzXML and MGF and automatically convert all other formats (e.g. mzML) to MGF using msconvert third-party utility.

Sequence Files (Required): fasta?

Yes, we expect a fasta file with nucleotide sequence(s).

Spectra-Sequence Correspondence File (Optional): format? .csv, .tsv, .txt? template available?

The file should be tab-separated and has two columns listing basenames of spectra and sequence files. If not provided, the all-vs-all analysis will be performed. The extension doesn't matter here, for instance, could be any of .csv, .tsv, .txt. We expect that the first column contains spectra info and the second one is about sequences. To change the order of columns, you can use an optional header line: Sequence Spectra (use tab in between, don't copy-paste a space character from here)

I will keep this issue open until I publish proper documentation for the RiPPquest GNPS workflow. Thanks.

amcaraballor commented 5 years ago

Thanks a lot dear @alexeigurevich , this reply is super useful until the MetaMiner gets incorporated into the GNPS. I will keep you posted for any issues.

mwang87 commented 5 years ago

@alexeigurevich Thanks for the detailed response. Feel free to add a page to the https://github.com/CCMS-UCSD/GNPSDocumentation page and we can make it live once the tool goes live.