VariantEffect / mavedb-api

MaveDB API
GNU Affero General Public License v3.0
8 stars 2 forks source link

Moving to non-sequential accession numbers #232

Open afrubin opened 2 weeks ago

afrubin commented 2 weeks ago

Many months ago, @jstone-uw raised the idea of moving away from the sequentially-generated accession numbers to something more like a UUID. At the time this seemed like a big departure so we didn't do it, but in hindsight it may solve a couple of problems for us.

First (and more importantly), this would allow us to determine the accession for a user when the dataset is yet unpublished, since its final accession no longer depends on what other datasets are added (and would not need to change). I think this might be the most elegant way to support users who want to include MaveDB accessions in their manuscripts without needing to make the data public yet.

Second, as part of our curation/harmonization push we will want to revisit a few of the existing datasets (like urn:mavedb:00000001) that do not follow the established conventions of one target per experiment set, one experiment per assay. There may be other examples of this "one experiment set per paper" style of data deposition. It feels weird to move experiments out of this into new experiment sets that are sequentially numbered.

As for implementation, my initial idea for this is to change the experiment set portion of the MaveDB accession number to be non-sequential and keep the experiment and score set addons if possible, since I really like how this supports discoverability.

However, those elements are sequentially created when the dataset is made public, so this creates a complication. We could have a different random string for each experiment that gets added to the experiment set and so on for score sets. Do folks think this retains the utility or adds confusion?

For the strings themselves, we could use UUIDs (as we already use them for temp accessions) or we could consider a more human-readable option, like selecting random words from the friendly words list or some other source. There are several packages that do this, or we could implement our own basic solution. I think this would be easier to share since there's plenty of circumstances where people will end up typing the accession in.

We would obviously need to create new-style accessions for all existing datasets and add some kind of redirect, but my hope is that this would be fairly straightforward.

This is a big change (from the users' perspective at least) so please share your thoughts in this thread.