MassBank / MassBank-web

The web server application and directly connected components for a MassBank web server
14 stars 22 forks source link

Accession code restrictions #11

Closed tsufz closed 1 year ago

tsufz commented 8 years ago

If the DB is re-factorised any way, we may think about to ditch the accession code restrictions? Of course, we need some conventions such as the first 3 letters or so for the institution. But who cares if in the back are provided any alpha # code of any length? I would like to use our internal chemical codes such as W305 to make management of the codes easier.

meowcat commented 8 years ago

Full support. I would go extreme and raise the limit to 64 alphanumeric characters or so, and give the first 8 characters to describe the institution. There is really no point in limiting ourselves to 8 characters.

The current scheme is very limiting even just in terms of how much space we can address. I actually thought about this problem a while ago wrt a discussion with the MoNA guys. My idea to use the space as well as possible was kind of the opposite solution of what @tsufz proposes (but works within the boundaries of the existing, see link below) however expanding accession code length is the better solution.

https://docs.google.com/document/d/1JfFC-Evay8mlMaeYNdMjM14Nij7K7cUQpX4ZjZAqffo/edit

tsufz commented 8 years ago

Interesting discussion in the paper. Why not hashing from any (new) tags? If I renew a record and use the same parameters for the input tags, the hash is the same. The tags are in my compounds lists and should not change. We can set a minimum of information which characterises a single record (compound name, instrument, CE etc.) and give the option to use a limited set of optional labels (internal ID, etc). With those tags, the accession could be hashed. I would use existing tags rather than to generate more and more new tags which have to be curated by the users and the DB systems.

Some technical issues: So far the accession id is stored in a Char(8) field and hence larger codes are truncated (16bit says hi). A solution within the 8 digits may be the easiest way to go. Special characters are forbidden as the accession is used to build urls.

meowcat commented 8 years ago

No hashing because there is no collision-free hashing and we can not afford a collision (this would mean that two records have the same accession code -> accession code must be unique in the database.)

Special characters are forbidden as the accession is used to build urls.

And filenames - therefore also no upper/lower case distinction. So up to now only 36^8 possibilities.

So far the accession id is stored in a Char(8) field and hence larger codes are truncated (16bit says hi). A solution within the 8 digits may be the easiest way to go

True, it is the solution we can currently do without any changes on the MassBank side (even though it doesn't give you all the freedom you would like). However if any database structure changes are being considered, this would be an absolute number 1 priority one.

meowcat commented 8 years ago

No hashing because there is no collision-free hashing and we can not afford a collision (this would mean that two records have the same accession code -> accession code must be unique in the database.)

Also, the issue is that we would be hashing a relatively densely populated space into a relatively limited hashing space, which would make clashes even more frequent - the odds of NOT having at least one clash will rapidly sink.

tsufz commented 8 years ago

hashes Okay, I got it.

However if any database structure changes are being considered, this would be an absolute number > 1 priority one If we introduce the splash in indexing, we need to touch the DB architecture anyway. In this case, it no big issue to enhance indexes with some more interesting and characterising tags (eg InCHIkey, CSID, CID, etc.). In the follow up, it should be also possible to enhance the API with those new tags?

m-arita commented 8 years ago

Hi all,

Hashing the mass spectra is ongoing and is almost resolved. Steffen and many others (including me) agree on using a hashing method (SPLASH) like the InChI key for mass spectra.

For this reason I agree to use any kind of IDs as long as we use the same hashing method for metadata sharing.

Best,

Masanori Arita

2016-02-11 21:43 GMT+09:00 meowcat notifications@github.com:

No hashing because there is no collision-free hashing and we can not afford a collision (this would mean that two records have the same accession code -> accession code must be unique in the database.)

Special characters are forbidden as the accession is used to build urls.

And filenames - therefore also no upper/lower case distinction. So up to now only 36^8 possibilities.

So far the accession id is stored in a Char(8) field and hence larger codes are truncated (16bit says hi). A solution within the 8 digits may be the easiest way to go

True, it is the solution we can currently do without any changes on the MassBank side (even though it doesn't give you all the freedom you would like). However if any database structure changes are being considered, this would be an absolute number 1 priority one.

— Reply to this email directly or view it on GitHub https://github.com/MassBank/MassBank-web/issues/11#issuecomment-182845127 .

Masanori Arita (arita@nig.ac.jp) National Institute of Genetics Yata 1111, Mishima City, 411-8540 Shizuoka, Japan Tel: +81-(0)-55-981-9449

sneumann commented 7 years ago

If the accession number scheme changes, make sure to notify http://www.ebi.ac.uk/miriam/main/collections/MIR:00000273 of the change.

tsufz commented 6 years ago

I come to the limits of the accession code and how RMassBank handels it. I would prefer if we could open the acceession code to prefix and than a alphanumeric hash or what else, e.g. with specific random numbers which are fixed with each code in order to avoid collissions. The existing system makes much work for the provision of records with multi-collission energies where the experiments where done in different measurements. This is really annoying and we need that for different projects.

tsufz commented 3 years ago

This issue is discussion also in https://github.com/MassBank/MassBank-data/issues/84. I suggest going ahead with discussion in https://github.com/MassBank/MassBank-data/issues/84 and then use this issue for the implementation.

meier-rene commented 1 year ago

I think this has been addressed with the new accession scheme.