OHDSI / CommonDataModel

Definition and DDLs for the OMOP Common Data Model (CDM)
https://ohdsi.github.io/CommonDataModel
888 stars 451 forks source link

[Proposal] Increase varchar(255) limit or change column data type to 'text'/'varchar' #571

Closed clemensrieder closed 1 year ago

clemensrieder commented 1 year ago

I stumbled upon this using the CPT4.jar which would extract the concept_name-values and update the CONCEPT.csv for the CPT-4 vocabulary. It will prune every concept_name to a maximum of 255 characters which fits the .ddl-files. I don't know if this is because of legacy versions of the OMOP CDM or any other database definition language but it would make sense to increase

a) either the varchar(255) to something more reasonable or b) change the data type to text or varchar (without the brackets) so you don't have to specify any maximum length

varchar can hold up to ~ 65k bytes. UTF-8 characters use 1-4 bytes. So in the worst case, ~16k characters could be stored in one field. The varchar(255) is currently present in 17 columns, varchar(250) in 6 columns, varchar(1000) in 1 column, varchar(50) in 54 columns, etc.

One would also have to update the CPT4.jar script to adjust to a new concept_name-length.

Thank you for your consideration!

cgreich commented 1 year ago

The point of the length restriction is to make the concept_names (or other fields in the OMOP CDM) functional: A description should be a description (not a textbook paragraph), and developers should be able to design their screens without having to accommodate absurdly long strings.

If you have longer descriptions you can put them into CONCEPT_SYNONYM. There you have 1000 characters. Everything above - not sure what the use case would be.

clemensrieder commented 1 year ago

The point of the length restriction is to make the concept_names (or other fields in the OMOP CDM) functional: A description should be a description (not a textbook paragraph), and developers should be able to design their screens without having to accommodate absurdly long strings.

If you have longer descriptions you can put them into CONCEPT_SYNONYM. There you have 1000 characters. Everything above - not sure what the use case would be.

I understand your point of view that information should be as precise as possible but in this case, I'm not so sure if it is really helpful to artificially prune an official vocabulary (CPT-4).

a) Convincing the CPT-4 maintainers to be more precise and shorten their texts seems like a futile task to me. b) Searching on the database level becomes less precise because information is removed by the pruning, therefore increasing the chance of more cumbersome queries.

So I would strongly advocate for an increase of the character limit for CONCEPT.concept_name (e.g. varchar(500)) and subsequent modification of the CPT4.jar while leaving all the other constraints as is, following your argumentation.

cgreich commented 1 year ago

I understand you want to create a clean database where nothing is dropped. But the use cases is probably not there. Look: You have 1000 characters in the synonym if you absolutely have to get the full name. But the full name tends not to be that useful. Take CPT4 725120, pruned at 1000 characters:

Assessment of and care planning for a patient with cognitive impairment, requiring an independent historian, in the office or other outpatient, home or domiciliary or rest home, with all of the following required elements: Cognition-focused evaluation including a pertinent history and examination, Medical decision making of moderate or high complexity, Functional assessment (eg, basic and instrumental activities of daily living), including decision-making capacity, Use of standardized instruments for staging of dementia (eg, functional assessment staging test [FAST], clinical dementia rating [CDR]), Medication reconciliation and review for high-risk medications, Evaluation for neuropsychiatric and behavioral symptoms, including depression, including use of standardized screening instrument(s), Evaluation of safety (eg, home), including motor vehicle operation, Identification of caregiver(s), caregiver knowledge, caregiver needs, social supports, and the willingness of caregiver to t...

This is a long list of permitted assessments paid by the government. I would claim everything after "Assessment of and care planning for a patient with cognitive impairment..." is useless from an analytical point of view. In fact, most if not all of these ultra-long codes are assessments or reports, followed by a precise list of what can be claimed.

If we had the result of the assessment - that would be something. But this? Somebody sat down with the patient and filled out a long sheet of paper.

clairblacketer commented 1 year ago

Hi @clemensrieder we need to keep the varchar(255) for our downstream tools that cannot render unlimited concept names as well as not allowing the vocabulary to hog too much memory. However, if you would like to change it in your instance you are more than welcome to do so. Most methods should still work as long as the concept_name stays a varchar.