Simplify output file names towards use of ISA-Tab files in workflows

pcm32 commented 8 years ago

Hi there!

Thanks for the great work on this. I would like to discuss an improvement that I would like to contribute. Currently, the mzml2isatab.xml Galaxy wrapper produces output files that are available to the user through the html file produced for galaxy. This means that currently the i_Investigation.txt, s.txt and a.txt generated files cannot be used downstream of the tool for a workflow in Galaxy. Additionally, the naming and location of these files makes them impossible (at least to me) to discover them in Galaxy to bless them as direct outputs.

My suggestion would be:

Have a flag to wrapper.py which if present simplifies output to:
- Leave results in the working directory of Galaxy
- Use standard names for the i, s, and a files, so that they can be easily mapped to outputs.

The second sub point assumes that there is only one i, s, and a files per set of mzml files provided. Is this always true? Is there a reason to have two separate a files for instance? Maybe @djcomlab or @proccaserra can comment on this.

Fixed names and no study name based subdirectory would allow to easily use:

<outputs>
 <data name="ISA_I_File" format="txt" from_work_dir="i_Investigation.txt" label="ISA I File"/>
 <data name="ISA_S_File" format="txt" from_work_dir="s_samples.txt" label="ISA S File"/>
 <data name="ISA_A_File" format="txt" from_work_dir="a_file.txt" label="ISA A File"/>
</outputs>

other discoverable ways such as <discovered_datasets designation="a_"/> apparently don't allow workflow usage.

We are working on something to upload ISA files from Galaxy to MetaboLights, and hence we would like to be able to use the ISA files down the line in Galaxy.

An alternative would be to make wrapper.py to be able to produce a zip file with everything ready for upload.

Thanks!

pcm32 commented 8 years ago

A third option would be to mess instead with the cheetah code in the [CDATA... ] part, to copy the files produced to a discoverable directory (the HTML file associated one, to the best of my knowledge, is not discoverable) with a name that is defined.

Let me know what you think to start doing something!

althonos commented 8 years ago

Hi Pablo, there can be cases of multiple assay files, for instance with mzml2isa if a study contains both positive and negative scans of samples, it will get split between a _POS file and a _NEG file...

However, if should be possible within wrapper.py to call full_parse using "." as the input dir, "output" as the output dir, and then use shutils to get back the files in the output/_STUDYNAME/ back into the galaxy folder, renaming them in the process.

pcm32 commented 8 years ago

Thanks Martin! Ok, so we can have a collection of a files, I think we can deal with that with datasets provided that they are sitting in the correct directory. Can we have s files collections as well, or only for a_ files?

rsalek commented 8 years ago

also names of the ISA files comes from title used, we can have a standard fixed names, but for the galaxy version.

althonos commented 8 years ago

Well, the ISA design allows for multiple s_ but mzml2isa never generates more than one, so we're clear :)

And concerning the a_ files, it's simple enough I guess, it will be either two files: a_xxx_POS and a_xxx_NEG, or simply a_xxx for single polarity studies.

pcm32 commented 8 years ago

yes, we would change the file names only for Galaxy, at upload or other operations, we would revert to the study name.

rsalek commented 8 years ago

Well, the ISA design allows for multiple s_ but mzml2isa never generates more than one, so we're clear :)

True, a limitation within MetaboLights as we can not accept multiple s_....txt

Tomnl commented 8 years ago

Great to hear that you are working on a Galaxy to Metabolights uploader.

If the names are changed at upload that should be OK but I think it might be better if we didn't have to change the names of the files.

Me and Ralf (@RJMW) have discussed potentially using the dataset-collections tool to help us out.

It looks like you can discover files based on some sort of regex e.g.

<outputs>
        <collection type="list" label="$job_name" name="output1">
            <discover_datasets pattern="(?P&lt;name&gt;.*)" directory="SampleDataset" />
        </collection>
</outputs>

I am not sure how well it is implemented though.

Perhaps the easiest way is the option to just zip the output? That way we can have a static standard name.

proccaserra commented 8 years ago

@pcm32 @althonos , the specifications allow for:

one i_xxx.txt file
one or many s_xxx.txt files (in case multiple studies are declared in the investigation)
one or many a_xxx files (each study may declare one or more assays). Note that those multiple assay tables would cover different acquisition modes for mass spectrometry (neg/pos as pointed out by Martin) but also in case a study uses MS and NMR and HPLC as seen in some submissions to Metabolights

Furthermore, and of certain value for PhenoMenal, ISA archives could also contains assays for genomics and transcriptomics (RNA-Seq, genechip) which could also been piped into specific Galaxy workflows

proccaserra commented 8 years ago

@rsalek @pcm32 @althonos A curation task could be regularize all Metabolights file names based on accession number , ISA file types and assay type/acquisition mode. This could be part of a post processing step following data submission/deposition and assignment of an official EMBL-EBI Metabolight accession number.

RJMW commented 8 years ago

Hi Pablo and Team,

the naming and location of these files makes them impossible (at least to me) to discover them in Galaxy to bless them as direct outputs.

You could use "$html_file.extra_files_path". It will basically give you the location / path of the different ISA files. The downstream tool should have the html file as an input, so you can call "extra_files_path" to retrieve the path to the different ISA files.

See here: https://wiki.galaxyproject.org/Admin/Tools/MultipleOutputFiles

A common usage of this strategy is to have the primary dataset be an HTML file and then store additional content (reports, pdfs, images, etc) in the dataset extra files directory. The content of this directory can be referenced using relative links with in the primary (HTML) file, clicking on the eye icon to view the dataset will display the HTML page.

althonos commented 8 years ago

Don't you have access to the study name you pass as the argument of mzml2isa ? In that case, files would be "iInvestigation.txt", " a${study_name}_massspectrometry" or something, "s${study_name}" ...

rsalek commented 8 years ago

... A curation task could be regularize all Metabolights file names based on accession number , ISA file types and assay type/acquisition mode. This could be part of a post processing step following data submission.

This is a great idea @proccaserra , I like it

althonos commented 8 years ago

Also, the mzml2isa.parsing.full_parse has an optional argument split which defaults to True, but which if set to False should prevent the splitting of the assay files. With such a configuration, the xml config would only need to contain the following:

<outputs>
 <data name="ISA_I_File" format="txt" from_work_dir="output/${name_of_study}/i_Investigation.txt" label="ISA I File"/>
 <data name="ISA_S_File" format="txt" from_work_dir="output/${name_of_study}/s_${name_of_study}.txt" label="ISA S File"/>
 <data name="ISA_A_File" format="txt" from_work_dir="output/${name_of_study}/a_${name_of_study}_metabolite_profiling_mass_spectrometry.txt" label="ISA A File"/>
</outputs>

pcm32 commented 8 years ago

Thanks all for the comments. I'll try them and report, my understanding/experience is that you cannot always use those variable names in the different fields of the XML placeholders, only some of them appear to accept them. Some of the suggested alternatives I have tried and I think they failed, but will try all of them in order and report. If none of them works, I would incline for a zip file containing all the isa files as produced by mzml2isa. I will get back soon on this.

ISA-tools / mzml2isa-galaxy

Simplify output file names towards use of ISA-Tab files in workflows #3