chorus-ai / Chorus_SOP

ChoRUS centralized SOP documentation site
https://chorus-ai.github.io/Chorus_SOP/
Apache License 2.0
1 stars 1 forks source link

[SOP Document] Multimodal Data Linkage #28

Closed jshoughtaling closed 4 months ago

jshoughtaling commented 6 months ago

Work to be done for next draft of Linking SOP - TO DO PRIOR TO 16 May

Here's a link to the working branch with running additions: https://github.com/chorus-ai/Chorus_SOP/tree/review-multimodal-linkage

del42 commented 6 months ago

@jshoughtaling where is the doc draft? is it google doc?

jshoughtaling commented 6 months ago

@del42 - It's in the /sop-website/docs/Multimodal-Linkage directory on the branch linked above:

https://github.com/chorus-ai/Chorus_SOP/tree/review-multimodal-linkage/sop-website/docs/Multimodal-Linkage

There you'll find the PNG diagram, and the more detailed SOP document that accompanies it.

del42 commented 6 months ago

@jshoughtaling I see it. But how to add the draft? Just give it to you? Or Should I just modify the .mdx file?

jshoughtaling commented 6 months ago

@del42 - I created some guidelines in the main README. Feel free to suggest any additions there if they're still not clear.

Briefly:

wa6gz commented 6 months ago

Added a short audience message and spelled out some of the links to other SOPs past and present for file formatting, though more could probably be done here.

jshoughtaling commented 4 months ago

Feedback to be addressed:

  1. Under "What this SOP does not do," there's reference to data linking after sites have submitted data. Out of curiosity, is there specific, planned linking after submission? Is this referring to potential post-submission Privacy Preserving Record Linking?
  2. For what it's worth, I disagree with the recommendation to conflate file_id and procedure_occurrence_id values, which introduces an unnecessary and potentially misleading coupling of different concepts. If they are, then at a minimum, I would hope that tools and code are not developed in a way that exploits this value-equivalence across two differently purposed variables/columns. Unfortunately, if it's recommended, then many people will probably write code that requires the value-equivalence between file and procedure occurrence identifiers.
  3. Is the separation of blocks of values for image file identifiers vs. waveform file identifiers necessary? Or is that an artifact of assuming that file id assignments of one might not be "aware" of the other? Put differently, is it required that all file ids draw from a common range of global file identifiers?
  4. In our databases, including our main OMOP instances, we have needed to convert many IDs to bigint and were concerned about the potential of some existing programs or OHDSI tools truncating them. I'd suggest assuming bigint for new initiatives to avoid the accidental development of tools that use narrower integer representations, which will break when the scale of data inevitably grows and requires bigint. As I mentioned, we're already seeing it.
  5. I don't think "Be an integer" should be a sub-item of "IF you are using file_id value as procedure_occurrence_id." If it's a procedure_occurrence_id value, then it has to be an integer anyways. Since all other OMOP ids are integers, and since you're thinking of creating an OMOP extension specification, I'd suggest requiring it to be an integer generally.
  6. Will the intended "real-world" idea of how to optimally group files be specified? Providing guidance on how to group files optimally would be beneficial. It ensures uniformity and helps sites understand the best practices for data organization, which is crucial for downstream processing.
  7. Regarding procedure_concept_ids for Imaging Procedure and Monitoring Procedure, if sites aren't already mapping to more granular concepts under those two broad ancestor concepts, I recommend assuming more granular mapping with respect to how code is written and new tools are developed -- i.e., even if most sites are mapping to those two general concepts, they should be approached through the concept hierarchy, guaranteeing appropriate rollup of more granular mappings, from the beginning.
  8. Regarding procedure_source_value, I'd suggest specifying a concatenation pattern or, alternatively, a metadata standard of specifying the concatenation pattern so that generic code can be written. Standardizing the concatenation pattern can enhance consistency and facilitate the development of generic code, making the system more robust and interoperable.