Data library files in remote ftp/s3 location [Enhancement]

MatthewRalston commented 8 years ago

I have an idea for a feature for galaxy and bioblend: namely I'd like to have access to data libraries with files located in S3. Similar to the ObjectStore, but for data library files instead of datasets. Currently, the data library and bioblend support "links" to files, where a data library can simply "point" to a file's location on a local disk and users can pull the datasets into their current histories to begin an analysis. It would be interesting to have similar functionality for remote file locations.

AFAIK the current implementation of the data libraries can retrieve publically available files from S3 or an ftp site with a certain bioblend method, but there is no set of methods to tell galaxy to instead retrieve the files only when needed and to store the url in the meanwhile. Of course there is no guarantee that the files would actually be there, but the same is true for the "links" to files mentioned above. Perhaps there is some way to manage this edge case.

I'd like to begin work on this feature for the galaxy community, but I don't know where to start in the Galaxy codebase. @dannon who is the expert on data libraries so that I can begin working on this feature?

jxtx commented 8 years ago

A related idea we've been thinking about is a data library that is a entirely just a view to another data source. For example, an S3 bucket containing an index file with metadata about datasets which could populate a library on the fly, and then allow on demand access to the data therein.

@martenson is currently the main person working on data libraries, and @afgane has been working on ideas for data federation that are likely relevant (this would presumably use the objectstore as an abstraction layer).

MatthewRalston commented 8 years ago

Where in the codebase is the data library code? Object store could be too bulky for what we need, no? We might just need some of the configuration parameters for boto access and then some logic in the data libraries class on what to do with a url link instead of a link to a file. Does this sound right?

martenson commented 8 years ago

@MatthewRalston frontend: https://github.com/galaxyproject/galaxy/tree/dev/client/galaxy/scripts/mvc/library API: https://github.com/galaxyproject/galaxy/tree/dev/lib/galaxy/webapps/galaxy/api (look at libraries and folders and lda_datasets) other library controllers: https://github.com/galaxyproject/galaxy/tree/dev/lib/galaxy/webapps/galaxy/controllers permissions: https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/security/__init__.py model: https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/model/__init__.py

MatthewRalston commented 8 years ago

Thanks @martenson

MatthewRalston commented 8 years ago

@afgane can you provide some direction regarding the object store and some of the locations in the galaxy codebase?

@martenson I would like to begin working on this. In bioblend's library interface, there is a method called "upload_form_galaxy_filesystem" which has an option called "link_data_only". I assume that under the hood, Galaxy is storing the file's path instead of copying the data into Galaxy and then storing the location of the copy. It looks like galaxy then just passes the path of that dataset to the first tool that uses it. I am guessing that a dataset hosted remotely will need to be "fetched" when a user tries to import the dataset into their history. Maybe behind the scenes remote datasets can be differentiated from locally available ones and if they are located in an s3 bucket they can be fetched with the admin's objectstore credentials? Does this seem valid given the data library security layer?

dannon commented 8 years ago

@MatthewRalston All of the objectstore code is in https://github.com/galaxyproject/galaxy/tree/dev/lib/galaxy/objectstore

I would peruse the current S3ObjectStore first as we'll want to use the same caching mechanism for the library feature.

'link_data_only' actually refers to symlinking the data on the local filesystem, and is unrelated.

martenson commented 8 years ago

@markiskander ad objectstore: check out https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/objectstore/__init__.py

ad libimport: I am not too familiar with bioblend however I think that you will be better off starting to implement this on the Galaxy side, not on the bioblend side.

Galaxy stores the linked data info in dataset table in external_filename column but this is only used in libraries I think

Logic for when to fetch and how long to cache/store locally should probably try to be pretty smart (Galaxy needs to detect metadata on its datasets etc.)

What are the 'admin's objectstore credentials' ?

MatthewRalston commented 8 years ago

@martenson the objectstore credentials are part of the galaxy.ini as os_access_key & os_secret_key. I was envisioning access to files first with boto or get requests a la Cloudman. Fetching should occur when a user requests the file to be imported. How do you all envision dataset retrieval for the user? How would this affect the db? Would this be something that happens in the background (a yellow dataset during the fetching operation) for the user? Or would it look more like the "Get Data" tool with a dialog during the download?

dannon commented 8 years ago

@MatthewRalston Did you check out the S3ObjectStore and the caching there? I think looking at that and maybe tinkering with it a bit will answer a lot of your questions and maybe guide this effort a bit.

If library datasets were never directly usable for anything, then I'd say we could just store a link and only fetch once the dataset was created as an HDA, but since they are we need a more generic caching mechanism.

golharam commented 8 years ago

Why is the data fetched? To determine data type and other meta information about the file? Is it possible to give Galaxy this information somehow so the file doesn't need to be fetched?

Envision a small instance running Galaxy front-end. When it fetches a large FASTQ file, the instance may not have enough space. Ideally, providing galaxy with all the necessary meta information so its doesn't need to download the file would be better.

Another example...Galaxy runs on a local (in-house) computer, and all analysis are performed in AWS. In this case, the data should never be downloaded...only the results of tools.

martenson commented 8 years ago

@golharam Your vision is very valid and this is something we would like to have. However currently Galaxy does not support 'accepting' metadata from sources other than its own metadata generator script I think. (so that would be first thing to change)

dannon commented 8 years ago

I actually talked about this with @afgane again somewhat recently as well, for being able to have 'views' to other galaxies from, say, a cloud instance. It's definitely something we're interested in, though it's not an immediate priority (so, if you wanted to take a stab at this @golharam, that'd be super cool and I'd be happy to try to figure out a plan with you).

golharam commented 8 years ago

I'll see what I can do. Can you point me to the relevant sections of code in Galaxy I should start looking at?

On Fri, Apr 29, 2016 at 10:26 AM, Dannon Baker notifications@github.com wrote:

I actually talked about this with @afgane https://github.com/afgane again somewhat recently as well, for being able to have 'views' to other galaxies from, say, a cloud instance. It's definitely something we're interested in, though it's not an immediate priority (so, if you wanted to take a stab at this @golharam https://github.com/golharam, that'd be super cool and I'd be happy to try to figure out a plan with you).

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/galaxyproject/galaxy/issues/1568#issuecomment-215734512

bgruening commented 7 years ago

@golharam sorry that we missed your last question. Do you still plan to work on this? What do you need?

galaxyproject / galaxy

Data library files in remote ftp/s3 location [Enhancement] #1568