galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.37k stars 992 forks source link

Import / Export of folders #2566

Open korseby opened 8 years ago

korseby commented 8 years ago

Within the project PhenoMeNal we need to process RAW metabolomics data. These data come from different vendors that do not save the data of single measurements in a single file but in several files in a folder. Thus, for further processing in Galaxy, we need to import the whole folder. The implementation of "Composite Datatypes" in Galaxy does not work for us, because the file structure varies between the vendors, measurements and upon type of analysis (NMR, LCMS). Thus, we can not organise the files in the folder into a dataset.

How can we handle whole folders in Galaxy?

We tried to zip the folders and pass the zip file to the tool, but that is a complicated solution that takes a lot of time, because the data is very large.

Is there a possibility to import / export whole folders into Galaxy in order to process them further with our tools?

jj-umn commented 8 years ago

@korseby University of Minnesota has user create a Dataset Collection of the files and then use that collection as input to a workflow. See page 47 in https://github.com/galaxyproteomics/abrf2016/blob/master/ABRF_2016_SW4_Galaxy_for_Multi-Omics.pdf

lecorguille commented 8 years ago

@jj-umn The dataset collections as you use them will group a bunch of individual and independent dataset. In @korseby case, 1 data/measurement is actually containt in several files.

korseby commented 8 years ago

The pdf is very interesting (especially the multi-omics approach). However, lecorguille is right. The MGF files you are using are single files and not folders. We need to import RAW folders directly into Galaxy. The folders contain several files. The number of files differ between vendors and measurements, thus, making it very difficult to create a Dataset Collection from them.

bgruening commented 8 years ago

@korseby I guess the correct approach is to define a composite datatype for each vendor. I also thought of doing this here for our proteomics and metabolomics efforts but we decided to convert the the RAW data (mostly proprietary stuff) directly after obtaining and squeezing a open standard into Galaxy. Mostly because we needed for the conversion a Windows and all these problems with the RAW data. But a composite datatype(s) will work as well I guess.

lecorguille commented 8 years ago

Sorry @korseby, we had a parallele chatting in an other channel.

And as I said we can't ask the user-end to upload file per file and group/compose them file per file. @nsoranzo proposed to use the API and BioBlend to deal with that.

I translate that this way: Propose a web interface where user can browse their project folders (or link that to repo like metabolight) and do the job (upload data, set composite datasets) using the API Or ...?

korseby commented 8 years ago

@bgruening Problem is that the folder structure differs for each vendor and type of measurement. For example, sometimes there is an extra file with metadata, sometimes not... How can we handle this in Galaxy? As far as I understand it, the user would have to choose all the files by hand instead of 'just' choosing a directory...

@lecorguille Don't know about the API behind Galaxy, but in HTML5 there is an option to choose more than one file at once. Choosing a folder would mean choosing all the files inside that folder. Galaxy would then automatically create some kind of "Unspecified Composite Datatype".

bgruening commented 8 years ago

@korseby would it help if the composite datatype upload form would be smart enough to take X files and sort them automatically to your composite-datatypes, e.g. depending on the file ending?

korseby commented 8 years ago

@bgruening Not the file ending, rather than the filename(s) itself (at least for Bruker .d folders). But, yes. Since msconvert can do it, we can do it too. The only problem is to figure out the structure and the "fuzzyness" (dependencies) within the folders.

bgruening commented 8 years ago

@korseby my idea would be to enhance add_composite_file() by an additional attribute, that gives a hint about the actual filename/ending that is expected. This can be a regex for example or just .wiff or an entire name. I would then create for every vendor one datatype and define this hint here: https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/datatypes/proteomics.py#L33

On the client side the upload tools needs to be enhanced to use this hint automatically to sort the chosen files into the correct composite-file.

korseby commented 8 years ago

@bgruening Thanks. Sounds like a good place to start. I'll try to add an entry for Bruker .d folders and test it here.

@lecorguille Can you update the client side so that we are able to choose more than one file?

sneumann commented 8 years ago

Hi, just as a concrete example, ftp://ftp.ebi.ac.uk/pub/databases/metabolights/studies/public/MTBLS2/

The User would expect to be able to select all her *.d directories in the file browser and upload to Galaxy.

The design has to be able to cope with the fact that each *.d has a set of fixed filenames, i.e. all will have an "analysis.baf" file, only distinguished by which directory it is contained in.

Yours, Steffen

bgruening commented 8 years ago

@sneumann what I proposed would work for one folder at a time. Do you think this is an acceptable way forward?

sneumann commented 8 years ago

For me, as an interim solution and proof of concept, yes. For the users in the long term having to upload folders for 100 samples, not. I guess we should aim at something that works as smoothly as the msconvertGUI which allows to select many raw files/directories for processing. Yours, Steffen

bgruening commented 8 years ago

Let's got with the proof-of-concept first and than scale up :) Thanks for your input!

korseby commented 8 years ago

Just for demonstration, I added an initial Composite Datatype description here:

https://github.com/phnmnl/docker-pwiz/blob/master/metabolomics.py

This file does not work yet nor is it tested. Some problems with that: Some files have fixed names, others don't. How do we handle the files that have different names? The .d folder contains another .m sub-folder. How do we handle this?

Obviously, you can't expect the user to upload all these 34 files by hand...