Reproduciblity and Large Dataset Collections BoF

jmchilton commented 7 years ago

Several of us met at GCC2017 to discuss this topic, with a focus on two different representations of collections of data - the Hyper-browser representation and the stock representation.

I made the following list of topics that sort of wanted to track and potentially follow up on - it wasn't really meeting notes so I apologize if I'm missing particular contributions to the discussion. Feel free to jump in and fill in details and new discussion.

Galaxy should export of collections with URIs
Galaxy should be able provide a tabular view of dataset collections to users and tools - using URIs potentially.
Galaxy should allow collecting and tracking more metadata ("facts" not "opinions").
Import sample sheets ** - collect metadata. Collect source.
Phillip wrote a bundled export tool available on the tool shed.
More tags in collections - definitely URI.
Libraries need collections! !!!
Send analysis rep. also use in publication. It is harder to track hundreds of datasets for reproducible.
We don't track where the data comes from in many cases
Best practice collections paper - so many ways to use them, the problems are different in some ways than older reproducibility focused papers. Biologists should see Galaxy as the protocol for managing large collections.
Encourage URI access - UI hints.

I'll keep this issue opened until the conversation dies and then maybe link out to concrete action issues.

pvanheus commented 7 years ago

Please expand on the references to URI in this?

sandve commented 7 years ago

I will be happy to follow up on these ideas when I am back from vacation in early August! I also believe several other people from the Oslo group will be happy to join in. As mentioned at the BoF, we have quite a lot of experience of how such information and representation is useful in various analytical settings. We know less about the experiences achieved with the current Dataset List solution in Galaxy, and what are the main plans of the Galaxy team in this direction. As said at the conference, we would be very happy to try to contribute towards a best possible solution!

Regarding the question about URIs, the idea is to have a standard way of representing a multiplicity of datasets, where each dataset would be represented by a URI that could e.g. be the URL of a bed file.

galaxyproject / galaxy

Reproduciblity and Large Dataset Collections BoF #4265