aces / Loris

LORIS is a web-accessible database solution for longitudinal multi-site studies.
GNU General Public License v3.0
143 stars 172 forks source link

[Core] Cleaning up our use of various identifiers #6626

Open johnsaigle opened 4 years ago

johnsaigle commented 4 years ago

Describe the solution you'd like CandIDs are hard-coded to be exactly 6 digit integers between 10000 and 999999.

There should be a configuration option that allows a project to specify a greater size if they wish.

Additional context This issue arose for the UK Biobank project which imports existing IDs from an external dataset. These IDs are 7 digits in length and therefore incompatible with the hard-coded CandID length in LORIS. Extensive overrides were made to the codebase in order to compensate.

6625 is one example.

ridz1208 commented 4 years ago

@johnsaigle is there more than 999999 candidates or is it just for compatibility ?

johnsaigle commented 4 years ago

Compatibility. The IDs come in as example 4356001 and it's a read-only dataset.

driusan commented 4 years ago

Then those aren't LORIS CandIDs, those are other types of identifiers.

johnsaigle commented 4 years ago

I copied this issue from an email with @xlecours and @samirdas. Samir asked that we implement this feature so I volunteered to create an issue, so I'm just the messenger in this case.

I do think it's a good idea for this to be flexible though. For most projects they can use the default 6 but as we move into supporting more open datasets, e.g. in PREVENT-AD and UK Biobank, I'm sure this will come up again.

ridz1208 commented 4 years ago

@johnsaigle I would say it's optional if # of candidates < 999999 and mandatory otherwise...

So in this case it seems like that number can be either a linked external identifier or even PSCID if the project decides so... I dont think the CandID should be "imported" from another system

johnsaigle commented 4 years ago

I see your point that it could be just a PSCID or ExternalID generally. I didn't do the overrides but maybe Xavier could comment on why the IDs were used as CandIDs

xlecours commented 4 years ago

I think it was because we wanted to avoid the semi-random-colision-free candid generation that would have taken a lot of time to find a suitable CandId for the last participants. We decided to take a "know" value; the UKB id.

driusan commented 4 years ago

If there's less than a million but close to a million you can use avoid collisions by using an incrementing number and randomizing the order in your importer. It would be much less work than trying to change the behaviour of CandIDs in LORIS, which is used as a foreign key all over the place.

johnsaigle commented 4 years ago

That's fair. I didn't realize it was locked to 6 integers in the DB too. That would make this a much more serious change.

johnsaigle commented 4 years ago

Discussed in the LORIS meeting today - we're going to have a meeting to sort out ID stuff generally in late June: