biocommons / biocommons.seqrepo

non-redundant, compressed, journalled, file-based storage for biological sequences
Apache License 2.0
39 stars 35 forks source link

Support custom/federated data #132

Closed jsstevenson closed 9 months ago

jsstevenson commented 10 months ago

Is your feature request related to a problem? Please describe. Our group has been using SeqRepo and other Biocommons/related tools to develop VRS mappings for MaveDB submissions. Part of this has involved adding individual experiment target sequences to SeqRepo so that they can be reused later during the VRS translation process.

Now, there are some methods to support this, but I think it's an underdeveloped use case relative to how we've been using it otherwise (i.e., syncing against a set of main snapshot sequences maintained at biocommons.org). At minimum, this process might be a little under-documented, but there's probably room for more explicit data management tooling. If I, for example, wanted to roll back to a previous checkpoint of data, I think I'd need to do so manually. Ditto for removing a specific chunk of added data (not just the most recent set of additions).

Broadly, though, @ahwagner has suggested there could be an interest in other branches of main snapshots (e.g. the Japanese reference genome), or perhaps a set of custom sequences used internally at a lab. A user might want to be able to select which reference genomes are stored in their local seqrepo and access all of them simultaneously.

Describe the solution you'd like My very naive solution would include

Describe alternatives you've considered A lot of this is possible with manual scripting on top of the existing library. We've done this already for our current MaveDB project, but - assuming we aren't the only ones interested in this kind of use case - it might be better to solidify these functions and build them into the core library.

Additional context This is quite vague and aspirational. Happy to hear input from others.

jsstevenson commented 9 months ago

I think a lot of this is more in the "could be documented more" bucket than "needs new code". Closing this issue now, may try to come up with more specific subtopics later.

reece commented 9 months ago

Okay. Also note that this is already on the roadmap for #61 and #136 .

jsstevenson commented 9 months ago

@reece right, I progressively realized a lot of this was redundant and/or unnecessary. Thanks!