Composite data type for ISA-Tab + plus data in the context of metabolomics (mass spec and NMR)

pcm32 commented 7 years ago

At the PhenoMeNal project we are working towards making large scale computation for metabolomics possible on cloud providers. We are working to make use of Galaxy at the user interface layer, on top of our stack. In this context, we are in need of a proper way of accessing ISA-Tab metadata files and accompanying data from Galaxy. We are hoping to contribute an ISA-Tab composite data type in the context of metabolomics (so, the composite would manage files like mzML, nmrML, etc.). Our current approach is just to hide all of these from Galaxy by putting files inside a tar/zip, this if course far from being the proper solution.

ISA lends itself very well for composite data types, as there is a single initial file (i_investigation file) from which the whole directory structure (s, a, maf and data files) can be explained (including assignment of defined experimental factor values to data files).

Initially our concerns for the use case are as follows:

[ ] Scalability: Metabolomics studies, specially mass spec ones, can be very large in terms of data size (many mzML files, each a few hundred MBs easily). As such we want a solution that scales well, paritcularly that avoids file duplication/copy. One should be able to download a MetaboLights study into Galaxy (and hopefully analyse it as well) without any file duplication. This is due to some studies reaching the hundreds of GBs in size.
[ ] Jobs distribution: Ability to distribute jobs that process the main data files (ie. mzMLs) contained in a study. We realise that job distribution is something done over collections rather than inside composite data types, but we wonder if with some guidance we could enable such a feature with some changes in the core galaxy code. Maybe you can have collections within composites?
[ ] Recognized by core metabolomics tools: we would aim to provide integration with some of the metabolomics tools already ported to Galaxy that need to process the main data files within this new composite.

Formats that we should support initially:

[ ] mzML
[ ] nmrML
[ ] ISA files
[ ] MAF file (maybe this could be done in an inherited version of the composite)

We have found so far some interesting examples, like this in proteomics, besides the commonly cited one for genetics. Any other good example that we should be aware of?

At this point we are after feedback from the Galaxy community and core developers on how to build this in the best possible way (most compliant with what Galaxy expects). I'm tagging people as requested: @dannon. More collaborators will add more comments as I share this link (I don't seem to be able to tag them here, I guess you can only tag project members).

bgruening commented 7 years ago

Very interesting. We have discussed recently to add ISA-Tab for genomics as well, concrete for ResearchObjects, BAGs and uploading to ENA.

The proteomics datatypes are now living here: https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/datatypes/proteomics.py

@pcm32 do you have time drop by at http://fair-dom.org/communities/systems-biology-developers-foundry/3rd-sbdf

korseby commented 7 years ago

We also discussed this topic briefly @ SG2016.

Maybe @guerler can help us too, creating a dataset collection inside a composite data type and how to implement that.

pcm32 commented 7 years ago

@bgruening you explained to @korseby on some emails that is not possible to have collections inside composite data types. Could you explain what is the current impediment and maybe give us some guidance on what it would be necessary to implement and where? We would be happy to put the man power to make this a reality. Of course, if you think that collections inside composite data types is not the answer to be able to distribute jobs naturally within Galaxy, then it would be great if you explain what would be the best solution for this within Galaxy. Or alternatively, please direct us to the core developer that would be more familiar with the core code that is related to the functionality that we want to implement.

Thanks!

bgruening commented 7 years ago

As far as I know a collection can contain multiple datasets (preferably with the same format). A tool will then iterate over every element (dataset/format) in this collection. If you have now one format (composite in this case) that has multiple files attached - the logic is reversed and I would expect this breaks stuff. But maybe I'm totally wrong here. @jmchilton can you give some advice here.

As far as I understood the ISA is in your case just the metadata description for your files underneath, correct? What I could imagine could be quite powerful and would benefit the framework in general, is a more flexible metadata handling of datatypes and collections. Maybe we could annotate a collection with the ISA metadata? Would this address your use-cases? Do you need to have different filetypes in a collections?

pcm32 commented 7 years ago

Many thanks @bgruening, yes, that is exactly what we are after, annotate a set of files based on the ISA descriptors, or in other words, make Galaxy ISA-aware so that it can iterate through the data files based on what is recorded in the ISA files. Yes, we could have different filetypes within a collection, and yes, ISA is the metadata description of the data files that the ISA accompanies.

The other side to this is that ISA files relies on defined relative paths to the data files that it accompanies, which we understand won't be broken by Galaxy if they are considered part of a composite. We don't want galaxy to move files from that structure, but just use them where they are.

bgruening commented 7 years ago

The other side to this is that ISA files relies on defined relative paths to the data files that it accompanies, which we understand won't be broken by Galaxy if they are considered part of a composite. We don't want galaxy to move files from that structure, but just use them where they are.

You can link data into Galaxy without copying them, so this should not be a problem I guess. I would still allow to copy data into Galaxy if the user wants to do this. I'm more wondering if we can attach all data from the ISA metadata to a collection and put all files that are defined in this metadata into a collection (linked if needed). More of less replicating the ISA data in the Galaxy-DB. Exporting a collection with ISA-like data would create a valid ISA-tab file with the correct links - a BAG or RO kind of thing.

yvanlebras commented 7 years ago

Hi @bgruening @pcm32,

Interesting discussion! Maybe the work we made on the GenOuest core facility with @cmonjeau can be of interest... Some first comments/questions: -I see that using galaxy data collection, we can have non explained failed runs (ex: with bwa or Bowtie2 on data collection containing 16 fastq files). -There is interesting Galaxy tools to manage data collections (create a proper data collection from a data collection containing failed jobs / combine data collections into one....) who can be of interest -concerning the files move, I'm not sure to understand. You don't want to copy datasets. So I assume you have an hundreds Gb ISArchive containing the data files? On your laptop or on a server? If you don't want to move data, this means the Galaxy server has to be on the same place? This make me think about the use of a dedicated Galaxy flavour who can be set up on demand on the data files server.... this flavour / needed galaxy tools / recepy can be an ISA information... what do you think about that ? If the Galaxy flavour can be a good way, this means we can easily use "galaxy hack" like the one we made to manage zip ISArchive (probably the same kind as yours) and install automaticaly wf4metabo tools, proteomics datatypes and ISA oriented Galaxy tools....

IMO, if ISArchive are distributed, it's better to don't put the data on it, Just URL / URI / DOI pointing a data repo (your own or/and a public one like ENA....) and download the raw data files when needed on the analysis server.... If all your data is on the same place, having a Galaxy instance (classical or flavour) on the same appears to be a good idea and there is Galaxy tools linking local data (for example all data from a folder) to a Galaxy server db and thus on a history (we had this kind of tools at GenOuest core facility)...

proccaserra commented 7 years ago

@bgruening @pcm32 @yvanlebras , great thread! but I am still unclear at what is the blocker preventing ISA-Tab being a Composite Data Type.

ISA: (-> = points to)

i_investigation.txt file (tsv)
-> 1..n s_study_sample.txt (tsv) -> 1.. n a_assay.txt (tsv) each a_assay.txt (tsv ) -> 1..n raw data files (ie. mzML or nmrML or Fastq or CEL or wiff files or Tiff files) depending on the assay type declared in the i_investigation file. All raw data types may be supported as described in the ../galaxy/datatypes/proteomics.py
decoupling metadata fetching from raw data fetching. as pointed by Yvan
provide ability to subselect datafiles (~create collections) based on ISA assay type AND/OR based on ISA assay_type + filtering criteria (e.g. sample characteristics or predictor variables)

Can Galaxy support this use cases with Composite Data Types ? If not, what component needs to be amended and how can we organize handshake between Galaxy and PhenoMenal/ISA developers?

Since SRA was mentioned in the thread, I was wondering if there had been request for Galaxy to deal with SRA XML documents, all xml but 5 distinct schemas?

pkrog commented 6 years ago

Hi @bgruening , We are currently implementing an ISA datatype for Galaxy at PhenoMeNal, and are wondering if that would be acceptable to take advantage of the already existing ISA-API library (Python module isatools) in our code. I've seen for instance that in https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/datatypes/assembly.py the FASTA files are parsed "by hand" and the code never uses the SeqIO library. Is there a reason for that? Should we also not use our isatools module and recode some parsing?

bgruening commented 6 years ago

@pkrog we try to keep dependencies as low as possible, but if you need to have a third party module we can add this as a requirement. We did this for BAM files and various hdf5 libs. I think ISA is of great importance and I would argue that one dependency here is ok. It's always a balance, if you can rewrite the sniffer without a library in a few lines of code, go for it. If it is more reliable to use the library lets depend on this.

Please make sure there is a python wheel available for your library.

pkrog commented 6 years ago

Thank you for your answer @bgruening . Since, as of today, ISA format exists in two flavours (JSON and tabular), it seems to us more reasonable to rely on the module than to code from scratch (or copy code from the module). I've added isatools latest version (0.9.2) inside pinned-requirements.txt, to see if that works. Everything went fine, and here the list of the additional packages that were installed:

Successfully built isatools mzml2isa functools32 openpyxl jdcal et-xmlfile
Installing collected packages: functools32, jsonschema, python-dateutil, pandas, networkx, lxml, chardet, iso8601, jinja2, beautifulsoup4, bs4, pronto, jdcal, et-xmlfile, openpyxl, mzml2isa, biopython, python-utils, progressbar2, isatools
Successfully installed beautifulsoup4-4.6.0 biopython-1.66 bs4-0.0.1 chardet-3.0.4 et-xmlfile-1.0.1 functools32-3.2.3.post2 isatools-0.9.2 iso8601-0.1.12 jdcal-1.3 jinja2-2.9.6 jsonschema-2.6.0 lxml-4.1.0 mzml2isa-0.5.1 networkx-2.0 openpyxl-2.4.9 pandas-0.20.3 progressbar2-3.34.3 pronto-0.8.0 python-dateutil-2.6.1 python-utils-2.2.0

Tell me if you have concerns about this.

CC: @djcomlab @kikkomep @pcm32 @ilveroluca

djcomlab commented 6 years ago

If necessary we can probably create a slimmed-down isastools module. A large amount of the functionality is to support imports/exports for formats to/from ISA, hence the slightly lengthy list of requirements.

bgruening commented 6 years ago

A limed down version would be nice I think. Not sure we need to progressbar and so on. If pypi has no wheel we need add one to https://github.com/galaxyproject/starforge

ping @natefoo to be sure I don't talk too much bs here.

natefoo commented 6 years ago

Sure, not a problem.

pkrog commented 6 years ago

When importing isatab in the datatype I got the following error:

galaxy.datatypes.registry ERROR 2017-10-21 07:42:38,475 Error importing datatype module galaxy.datatypes.isa: cannot import name zip_longest
Traceback (most recent call last):
  File "/home/pierrick/dev/galaxy-isatab/lib/galaxy/datatypes/registry.py", line 209, in load_datatypes
    module = __import__( datatype_module )
  File "/home/pierrick/dev/galaxy-isatab/lib/galaxy/datatypes/isa.py", line 20, in <module>
    from isatools import isatab
  File "/home/pierrick/dev/galaxy-isatab/.venv/local/lib/python2.7/site-packages/isatools/isatab.py", line 21, in <module>
    from itertools import zip_longest
ImportError: cannot import name zip_longest
galaxy.datatypes.registry ERROR 2017-10-21 07:42:38,478 Error importing datatype module galaxy.datatypes.isa: cannot import name zip_longest
Traceback (most recent call last):
  File "/home/pierrick/dev/galaxy-isatab/lib/galaxy/datatypes/registry.py", line 209, in load_datatypes
    module = __import__( datatype_module )
  File "/home/pierrick/dev/galaxy-isatab/lib/galaxy/datatypes/isa.py", line 20, in <module>
    from isatools import isatab
  File "/home/pierrick/dev/galaxy-isatab/.venv/local/lib/python2.7/site-packages/isatools/isatab.py", line 21, in <module>
    from itertools import zip_longest
ImportError: cannot import name zip_longest
galaxy.datatypes.registry ERROR 2017-10-21 07:42:38,480 Error importing datatype module galaxy.datatypes.isa: cannot import name zip_longest
Traceback (most recent call last):
  File "/home/pierrick/dev/galaxy-isatab/lib/galaxy/datatypes/registry.py", line 209, in load_datatypes
    module = __import__( datatype_module )
  File "/home/pierrick/dev/galaxy-isatab/lib/galaxy/datatypes/isa.py", line 20, in <module>
    from isatools import isatab
  File "/home/pierrick/dev/galaxy-isatab/.venv/local/lib/python2.7/site-packages/isatools/isatab.py", line 21, in <module>
    from itertools import zip_longest
ImportError: cannot import name zip_longest

@djcomlab , is isatools compatible with Python 2 or does it work only in Python 3? Is this zip_longest element essential?

ilveroluca commented 6 years ago

@djcomlab and @pkrog, this guide on writing code compatible with both Python 2 and 3 might be useful:

http://python-future.org/compatible_idioms.html#itertools-filterfalse-zip-longest

It even specifically mentions zip_longest :-)

pkrog commented 6 years ago

Hi @ilveroluca , great link ! Thanks. @djcomlab , do you mind if I start working on this issue and propose you a pull request for isa-api?

djcomlab commented 6 years ago

@pkrog Actually we decided quite a long time ago to only support Python 3 for ISA API.

There was some effort to try convert everything already done to Python 2.7 support, but I abandoned this there were actually many things that needed fixing and it was taking too much time away from feature development. You can find that in the py2 branch: https://github.com/ISA-tools/isa-api/tree/py2

For supporting ISA files what exactly is needed for Galaxy? I think it's pretty clear the full set of features in ISA API probably won't be needed and should be kept encapsulated in Galaxy tools. @bgruening mentions a sniffer - this is not something that's in the ISA API. We have some ISA-Tab and ISA JSON I/O and validators though, not sure how useful those might be.

pkrog commented 6 years ago

You're right, we don't need to access full isa-api inside Galaxy, only from inside the tools. In Galaxy, only loading the investigation information (investigation file in ISA-Tab and .json in ISA-Json) seems sufficient to me. What do you think?

The sniffer is here to detect the type (is it really an ISA archive?), it has already been implemented by @kikkomep and relies only on archive content. So we don't need isatools for that part. The validators would be useful inside a tools I guess, but not in Galaxy itself (too heavy).

For now, we only plan to display the information on the ISA archive, so the user can check and/or find information about the archive he has uploaded/downloaded.

djcomlab commented 6 years ago

OK sure, sounds good. Let's get a sniffer to work without ISA API as it sounds straightforward enough, perhaps with some simple syntactic checks to see if the ISA files in an archive are ISA-y enough. Let me know if you need any help or to review it.

djcomlab commented 6 years ago

Oh another thought - we have two ISA formats now, ISA-Tab but also ISA JSON. Should we aim to support both?

pkrog commented 6 years ago

The sniffer is already written.

Yes we want to support both ISA-Tab and ISA-Json formats on Galaxy. That's why I would like to use isatools module, since I would not like to maintain a code that can read both formats inside Galaxy.

djcomlab commented 6 years ago

OK cool.

But what does Galaxy need to read out of the ISA files necessarily, and into what kind of data structure?

djcomlab commented 6 years ago

We're just looking at the sniffer. Do you need the sniffer to do more complex checks such as validating the ISA input? If so then we need isatools.

pkrog commented 6 years ago

No I don't think we need to use isatools in the sniffer, or maybe just for checking that the archive is an ISA archive. Currently we just check that there is a file i_investigation.txt for ISA-Tab.

What I'm talking is the method display_data(), not the sniffer. For display_data() we need to read at least the information contained inside the file i_investigation.txt for ISA-Tab and *.json for ISA-Json. Use of isatools would be a lot better here.

yvanlebras commented 6 years ago

Just see your post on "...It would be better to have the display of an HTML page that presents nicely the content of the study..." @pkrog and this just me think about an (old) idea I have using Bii or preferentially the ""new"" version: ISAexplorer (http://scientificdata.isa-explorer.org/ used by scientific data ?) to display ISA-tab contents on Galaxy through an HTML page or things like that... I was thinking also to a Galaxy Interactive Environment.... but it will be better through a "classical" Galaxy tool...

yvanlebras commented 6 years ago

... At least, If ISA-Explorer code is not open anymore, one idea can be to use the ISA-tab viewer (demo )..... or not ;)

djcomlab commented 6 years ago

@pkrog For display_data() we need to read at least the information contained inside the file if we can define exactly what we wish to display, I can port just the ISA reader parts to Python 2.7 and produce whatever summaries we need.

We just need an example of what data to display...

kikkomep commented 6 years ago

I was looking over our discussion here and took a closer look at the pure Javascript viewer mentioned by @yvanlebras. I tried it out and it seems to work pretty well. It doesn't support ISA-JSON at the moment, but I think we can add that pretty easily. I'm thinking that the idea of using a pure Javascript viewer like this one is a great option. It also allows us to avoid both the dependency issues that we're seeing and the Python 2-3 compatibility problem.

djcomlab commented 6 years ago

If you're going to implement stuff on ISA-tab viewer please do contribute directly back to the project on https://github.com/ISA-tools/ISATab-Viewer as it is not supported by the developers here in the ISA team anymore.

pkrog commented 6 years ago

I'd prefer to stick with simple plain HTML display, using a well maintained isatools library. But certainly the Javascript would be nicer, if the code is maintained. Maybe @kikkomep you can open a new issue for improvement in phnmnl/galaxy, and we will see that in the future.

djcomlab commented 6 years ago

Pure HTML should be straightforward to render the CSV tables. For other summary info (like counts of varying properties of samples etc, like in the SciData Explorer) could be presented with some nice graphs generated by matplotlib quite easily. I'm already working on this to put in isatools.

pkrog commented 6 years ago

Hi @bgruening , we are still working on the ISA type in Galaxy, and almost ready to propose a pull request. However we are unable to implement a feature. We would like to have the name of the dataset set to the ISA identifier contained in the ZIP file. But the name is fixed and it seems there is no opportunity to update it from the datatype class once it has been set from init_meta(). Would it be possible to have a callback to the datatype class, once the upload job is done, so we can load part of the content of the extracted ZIP archive, read the ISA identifier and reset the dataset name so that the new name is updated into the client browser? Please advise use on how to do it, since we are willingly to make the modifications ourselves and include it into the pull request.

bgruening commented 6 years ago

@pkrog great news! Just to make sure, if you are talking about dataset name you are referring to the history entry name? We call this display_name, or are you referring to the actual name of the file located on the hard-drive?

pkrog commented 6 years ago

Yes, the name displayed in the history.

pkrog commented 6 years ago

I can set it from inside init_meta() or set_meta(), using dataset.name = .... But when I'm inside those methods, I don't have the extracted archive yet.

bgruening commented 6 years ago

Do you have some code I can look at? Inside set_meta you should be able to set it and read the archive right?

pkrog commented 6 years ago

You're right, inside set_meta() is called I can read the archive and set the name. But the display in the client browser isn't updated.

pkrog commented 6 years ago

You can see my code here: https://github.com/phnmnl/galaxy/tree/feature/isa-dataset-renaming. I've put many debug messages to understand the flow of execution.

pkrog commented 6 years ago

There is also a method display_name(), inside the class Data, that I have overwritten, but it seems never to be called.

bgruening commented 6 years ago

@pkrog I can look at this in more detail the coming days if this is ok for you. I don't know whats happening from the top of my head, but your code is a great start.

pkrog commented 6 years ago

Ok, that will be fine. Thanks @bgruening .

bgruening commented 6 years ago

@pkrog try the following.

Add this to the upload.py file:

    # Write the job info
    stdout = stdout or 'uploaded %s file' % data_type

+    foo_datatype = registry.get_datatype_by_extension( ext )
+    if getattr( foo_datatype, 'display_name', False ):
+        dataset.name = foo_datatype.display_name( dataset )

    info = dict( type='dataset',
                 dataset_id=dataset.dataset_id,
                 ext=ext,
                 stdout=stdout,
                 name=dataset.name,
                 line_count=line_count )

Assuming you have your own display_name() it should work. Hope that helps!

bgruening commented 6 years ago

That said, I know that @jmchilton is working on a new iteration of Galaxy uploads so not sure if this conflicts with his work.

pkrog commented 6 years ago

Hi @bgruening , thanks for your answer. However I cannot find the lines you are referring to. I've been looking into the dev branch. In which branch should I look?

bgruening commented 6 years ago

I was referencing this code here: https://github.com/galaxyproject/galaxy/blob/dev/tools/data_source/upload.py#L321

But maybe the code snipped needs to be moved up in the file. It was just a quick hack, let me know if you need more help.

pkrog commented 6 years ago

We are using version 17.09 here at PhenoMeNal, and it seems to code of the uploader is quite different. I just have a file lib/galaxy/tools/actions/upload_common.py here (there is no galaxy/tools/data_source/upload.py file) and I cannot find an add_file function. Am I looking in the wrong file?

bgruening commented 6 years ago

Here it is: https://github.com/galaxyproject/galaxy/blob/release_17.09/tools/data_source/upload.py

I was hacking on your branch, so it should be also there.

pkrog commented 6 years ago

Ok, I got it, thanks. Though it does not work, I'll try to see why.

pkrog commented 6 years ago

Hi @bgruening , I implemented your modification inside add_composite_file() but to no avail. The name of the dataset was never set. However, while debugging I accidentally provoke the setting the dataset name, by looking into attributes of the dataset. You can look at my code in PR phnmnl/galaxy#22. So it works now, but I don't understand how. Would you have an idea about why and how this ugly hack works?

galaxyproject / galaxy

Composite data type for ISA-Tab + plus data in the context of metabolomics (mass spec and NMR) #3134