Open MatthewRalston opened 8 years ago
A related idea we've been thinking about is a data library that is a entirely just a view to another data source. For example, an S3 bucket containing an index file with metadata about datasets which could populate a library on the fly, and then allow on demand access to the data therein.
@martenson is currently the main person working on data libraries, and @afgane has been working on ideas for data federation that are likely relevant (this would presumably use the objectstore as an abstraction layer).
Where in the codebase is the data library code? Object store could be too bulky for what we need, no? We might just need some of the configuration parameters for boto access and then some logic in the data libraries class on what to do with a url link instead of a link to a file. Does this sound right?
@MatthewRalston frontend: https://github.com/galaxyproject/galaxy/tree/dev/client/galaxy/scripts/mvc/library API: https://github.com/galaxyproject/galaxy/tree/dev/lib/galaxy/webapps/galaxy/api (look at libraries and folders and lda_datasets) other library controllers: https://github.com/galaxyproject/galaxy/tree/dev/lib/galaxy/webapps/galaxy/controllers permissions: https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/security/__init__.py model: https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/model/__init__.py
Thanks @martenson
@afgane can you provide some direction regarding the object store and some of the locations in the galaxy codebase?
@martenson I would like to begin working on this. In bioblend's library interface, there is a method called "upload_form_galaxy_filesystem" which has an option called "link_data_only". I assume that under the hood, Galaxy is storing the file's path instead of copying the data into Galaxy and then storing the location of the copy. It looks like galaxy then just passes the path of that dataset to the first tool that uses it. I am guessing that a dataset hosted remotely will need to be "fetched" when a user tries to import the dataset into their history. Maybe behind the scenes remote datasets can be differentiated from locally available ones and if they are located in an s3 bucket they can be fetched with the admin's objectstore credentials? Does this seem valid given the data library security layer?
@MatthewRalston All of the objectstore code is in https://github.com/galaxyproject/galaxy/tree/dev/lib/galaxy/objectstore
I would peruse the current S3ObjectStore first as we'll want to use the same caching mechanism for the library feature.
'link_data_only' actually refers to symlinking the data on the local filesystem, and is unrelated.
@markiskander ad objectstore: check out https://github.com/galaxyproject/galaxy/blob/dev/lib/galaxy/objectstore/__init__.py
ad libimport: I am not too familiar with bioblend however I think that you will be better off starting to implement this on the Galaxy side, not on the bioblend side.
Galaxy stores the linked data info in dataset
table in external_filename
column but this is only used in libraries I think
Logic for when to fetch and how long to cache/store locally should probably try to be pretty smart (Galaxy needs to detect metadata on its datasets etc.)
What are the 'admin's objectstore credentials' ?
@martenson the objectstore credentials are part of the galaxy.ini as os_access_key
& os_secret_key
. I was envisioning access to files first with boto or get requests a la Cloudman. Fetching should occur when a user requests the file to be imported. How do you all envision dataset retrieval for the user? How would this affect the db? Would this be something that happens in the background (a yellow dataset during the fetching operation) for the user? Or would it look more like the "Get Data" tool with a dialog during the download?
@MatthewRalston Did you check out the S3ObjectStore and the caching there? I think looking at that and maybe tinkering with it a bit will answer a lot of your questions and maybe guide this effort a bit.
If library datasets were never directly usable for anything, then I'd say we could just store a link and only fetch once the dataset was created as an HDA, but since they are we need a more generic caching mechanism.
Why is the data fetched? To determine data type and other meta information about the file? Is it possible to give Galaxy this information somehow so the file doesn't need to be fetched?
Envision a small instance running Galaxy front-end. When it fetches a large FASTQ file, the instance may not have enough space. Ideally, providing galaxy with all the necessary meta information so its doesn't need to download the file would be better.
Another example...Galaxy runs on a local (in-house) computer, and all analysis are performed in AWS. In this case, the data should never be downloaded...only the results of tools.
@golharam Your vision is very valid and this is something we would like to have. However currently Galaxy does not support 'accepting' metadata from sources other than its own metadata generator script I think. (so that would be first thing to change)
I actually talked about this with @afgane again somewhat recently as well, for being able to have 'views' to other galaxies from, say, a cloud instance. It's definitely something we're interested in, though it's not an immediate priority (so, if you wanted to take a stab at this @golharam, that'd be super cool and I'd be happy to try to figure out a plan with you).
I'll see what I can do. Can you point me to the relevant sections of code in Galaxy I should start looking at?
On Fri, Apr 29, 2016 at 10:26 AM, Dannon Baker notifications@github.com wrote:
I actually talked about this with @afgane https://github.com/afgane again somewhat recently as well, for being able to have 'views' to other galaxies from, say, a cloud instance. It's definitely something we're interested in, though it's not an immediate priority (so, if you wanted to take a stab at this @golharam https://github.com/golharam, that'd be super cool and I'd be happy to try to figure out a plan with you).
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/galaxyproject/galaxy/issues/1568#issuecomment-215734512
@golharam sorry that we missed your last question. Do you still plan to work on this? What do you need?
I have an idea for a feature for galaxy and bioblend: namely I'd like to have access to data libraries with files located in S3. Similar to the ObjectStore, but for data library files instead of datasets. Currently, the data library and bioblend support "links" to files, where a data library can simply "point" to a file's location on a local disk and users can pull the datasets into their current histories to begin an analysis. It would be interesting to have similar functionality for remote file locations.
AFAIK the current implementation of the data libraries can retrieve publically available files from S3 or an ftp site with a certain bioblend method, but there is no set of methods to tell galaxy to instead retrieve the files only when needed and to store the url in the meanwhile. Of course there is no guarantee that the files would actually be there, but the same is true for the "links" to files mentioned above. Perhaps there is some way to manage this edge case.
I'd like to begin work on this feature for the galaxy community, but I don't know where to start in the Galaxy codebase. @dannon who is the expert on data libraries so that I can begin working on this feature?