OHDSI / OncologyWG

Oncology Working Group Repository
https://ohdsi.github.io/OncologyWG
Apache License 2.0
56 stars 24 forks source link

Concatenation strings & labels for group staging terms do not follow convention #637

Open gkennos opened 6 months ago

gkennos commented 6 months ago

Convention for group stage terms is to use roman numerals to distinguish from arabic numerals for TNM staging.

Currently all Cancer Modifier group stages are labelled using 1,2,3,4 etc but should be I, II, III, IV...

e.g. https://athena.ohdsi.org/search-terms/terms/1635814

cgreich commented 6 months ago

@gkennos: Good point. But in sources the convention is weak, and roman and arabic numerals are mixed up. We explicitly cleaned that up when we put it in. The decision to use arabic was very pragmatic: Search is impossible with the roman notation (try getting stage II and not stage III). However, we should make that very clear. If we had Athena development resources, we could teach the tool to accept both.

gkennos commented 6 months ago

Not sure if the search has been updated since that decision, but it works for me? here

Screenshot 2024-03-15 at 4 46 01 pm

cgreich commented 6 months ago

Well, your screenshot is nicely cutting off the right margin with the vocabulary names. All those you listed are imported vocabularies. We are not fixing, say, SNOMED. In the Cancer Modifier vocabulary we authored ourselves all stages are Arabic.

gkennos commented 6 months ago

Apologies I was trying to crop out my user name, but see below for completeness (or in the provided link above)

The screenshot was only attached to show that search is functional, not to make any point of the included terms, but the cancer modifiers are definitely arabic numerals so I am not sure what you mean? here

Screenshot 2024-03-15 at 11 16 01 pm

cgreich commented 6 months ago

What's wrong with your user name? :)

I see what you mean. Well, Athena is trying to be smart. Looks like in this case it is. But it generally is struggling to create a reasonable list, and we have been futzing around with the various weights for partial words, upper and lower case, edit distance, all that. It lacks Google's background information and search can only tinker with the search string.

But if you go to Atlas, which uses simple SQL, you get all the stage IIIs. Same is true for any other application that doesn't have a smart search engine.

What do you have against the Arabic numbers? What use case is suffering?

gkennos commented 6 months ago

Nothing - just habit of cropping out anything that is my full legal name if not necessary.

The use-case that is suffering IMO, is the one of someone who is creating a new mapping and searches for 'Stage IV', as that is the term in the actual vocabulary, but cannot find it because it's been changed to 'Stage 4' as a technical workaround known only to a few people.

I understand that when you say both of those terms out loud, they are both 'stage four' of course. I would, however, argue there is a semantic difference, as the roman numerals are the ones expressly used by the AJCC standard (which the concatenation rules explicitly state is used as the source), and this is done to some degree for the purposes of differentiating those terms from other common staging measures for clarity, as 'stage' is quite an overloaded term.

If it is required as a technical workaround, would the more appropriate way to handle it be to either insert 'Stage 4' as a synonym or create a non-standard term that maps to the standard (vocabulary-specified) term?

cgreich commented 6 months ago

All understood and agreed, @gkennos. But the Arabic/Roman craziness is something we inherent from the existing vocabularies. So, all troubles you complain about not finding what you want is already there. For example, there are 798 "stage IV" records across all vocabularies except Cancer Modifier, and 207 "stage 4" containing ones. It's true, in oncology the Roman dominates over the Arabic, but even in NAACCR you have 6 "stage 4" in addition to 100 "stage IV". Now what?

In Cancer Modifier, we decided to at least have it one way. And yes, it might be surprising to someone used to the Roman as it is used more. But if your search returns no result for "stage IV" you will probably figure it out (e.g. trying just "stage").

Your idea with the synonym is good. However, then we may as well go Roman (and deal with not being able to distinguish between Roman I, II, III, IV, VI and VII when searching for "I"), because if we put them into the synonyms they will turn up in the search results of Athena. We could also add special code to the Athena search engine to do automatic search term modification for Roman numbers. To have it documented would also be a win, but where would that documentation help? Nobody reads the manual before conducting a search.

Not sure what the best solution is other than letting people figure it out.