Closed tsufz closed 1 year ago
Full support. I would go extreme and raise the limit to 64 alphanumeric characters or so, and give the first 8 characters to describe the institution. There is really no point in limiting ourselves to 8 characters.
The current scheme is very limiting even just in terms of how much space we can address. I actually thought about this problem a while ago wrt a discussion with the MoNA guys. My idea to use the space as well as possible was kind of the opposite solution of what @tsufz proposes (but works within the boundaries of the existing, see link below) however expanding accession code length is the better solution.
https://docs.google.com/document/d/1JfFC-Evay8mlMaeYNdMjM14Nij7K7cUQpX4ZjZAqffo/edit
Interesting discussion in the paper. Why not hashing from any (new) tags? If I renew a record and use the same parameters for the input tags, the hash is the same. The tags are in my compounds lists and should not change. We can set a minimum of information which characterises a single record (compound name, instrument, CE etc.) and give the option to use a limited set of optional labels (internal ID, etc). With those tags, the accession could be hashed. I would use existing tags rather than to generate more and more new tags which have to be curated by the users and the DB systems.
Some technical issues: So far the accession id is stored in a Char(8) field and hence larger codes are truncated (16bit says hi). A solution within the 8 digits may be the easiest way to go. Special characters are forbidden as the accession is used to build urls.
No hashing because there is no collision-free hashing and we can not afford a collision (this would mean that two records have the same accession code -> accession code must be unique in the database.)
Special characters are forbidden as the accession is used to build urls.
And filenames - therefore also no upper/lower case distinction. So up to now only 36^8 possibilities.
So far the accession id is stored in a Char(8) field and hence larger codes are truncated (16bit says hi). A solution within the 8 digits may be the easiest way to go
True, it is the solution we can currently do without any changes on the MassBank side (even though it doesn't give you all the freedom you would like). However if any database structure changes are being considered, this would be an absolute number 1 priority one.
No hashing because there is no collision-free hashing and we can not afford a collision (this would mean that two records have the same accession code -> accession code must be unique in the database.)
Also, the issue is that we would be hashing a relatively densely populated space into a relatively limited hashing space, which would make clashes even more frequent - the odds of NOT having at least one clash will rapidly sink.
hashes Okay, I got it.
However if any database structure changes are being considered, this would be an absolute number > 1 priority one If we introduce the splash in indexing, we need to touch the DB architecture anyway. In this case, it no big issue to enhance indexes with some more interesting and characterising tags (eg InCHIkey, CSID, CID, etc.). In the follow up, it should be also possible to enhance the API with those new tags?
Hi all,
Hashing the mass spectra is ongoing and is almost resolved. Steffen and many others (including me) agree on using a hashing method (SPLASH) like the InChI key for mass spectra.
For this reason I agree to use any kind of IDs as long as we use the same hashing method for metadata sharing.
Best,
Masanori Arita
2016-02-11 21:43 GMT+09:00 meowcat notifications@github.com:
No hashing because there is no collision-free hashing and we can not afford a collision (this would mean that two records have the same accession code -> accession code must be unique in the database.)
Special characters are forbidden as the accession is used to build urls.
And filenames - therefore also no upper/lower case distinction. So up to now only 36^8 possibilities.
So far the accession id is stored in a Char(8) field and hence larger codes are truncated (16bit says hi). A solution within the 8 digits may be the easiest way to go
True, it is the solution we can currently do without any changes on the MassBank side (even though it doesn't give you all the freedom you would like). However if any database structure changes are being considered, this would be an absolute number 1 priority one.
— Reply to this email directly or view it on GitHub https://github.com/MassBank/MassBank-web/issues/11#issuecomment-182845127 .
Masanori Arita (arita@nig.ac.jp) National Institute of Genetics Yata 1111, Mishima City, 411-8540 Shizuoka, Japan Tel: +81-(0)-55-981-9449
If the accession number scheme changes, make sure to notify http://www.ebi.ac.uk/miriam/main/collections/MIR:00000273 of the change.
I come to the limits of the accession code and how RMassBank handels it. I would prefer if we could open the acceession code to prefix and than a alphanumeric hash or what else, e.g. with specific random numbers which are fixed with each code in order to avoid collissions. The existing system makes much work for the provision of records with multi-collission energies where the experiments where done in different measurements. This is really annoying and we need that for different projects.
This issue is discussion also in https://github.com/MassBank/MassBank-data/issues/84. I suggest going ahead with discussion in https://github.com/MassBank/MassBank-data/issues/84 and then use this issue for the implementation.
I think this has been addressed with the new accession scheme.
If the DB is re-factorised any way, we may think about to ditch the accession code restrictions? Of course, we need some conventions such as the first 3 letters or so for the institution. But who cares if in the back are provided any alpha # code of any length? I would like to use our internal chemical codes such as W305 to make management of the codes easier.