Open antsh3k opened 3 months ago
I've got these ones locally from June and they seem to still produce the release date reliably:
SnomedCT_InternationalRF2_PRODUCTION_20240201T120000Z
SnomedCT_InternationalRF2_PRODUCTION_20240601T120000Z
SnomedCT_UKClinicalRF2_PRODUCTION_20240410T000001Z
SnomedCT_UKClinicalRefsetsRF2_PRODUCTION_20240410T000001Z
SnomedCT_UKDrugRF2_PRODUCTION_20240508T000001Z
SnomedCT_UKEditionRF2_PRODUCTION_20240410T000001Z
SnomedCT_UKEditionRF2_PRODUCTION_20240508T000001Z
SnomedCT_Release_AU1000036_20240630T120000Z
Which versions does the new naming convention start with? And what does it look like?
You are right nothing has changed. I think what I was alluding to was using the folder one level up. I could be mistaken in using the wrong level. In which case we need to throw an error as it can pass through without one.
For example, for the following names, this convention does not work.
uk_sct2cl_38.2.0_20240605000001Z
uk_sct2cl_32.6.0_20211027000001Z
so the above code is brittle anyway. data_path.split('_')[3]
would make more sense? with different split indices tried out?
so the above code is brittle anyway.
data_path.split('_')[3]
would make more sense? with different split indices tried out?
Yeah, I think we should be able to match the folder basename with regex and pull the third group:
^SnomedCT_([A-Za-z0-9]+)_([A-Za-z0-9]+)_(\d{8}T\d{6}Z$)
If there's no match, we can raise an exception. If there's a match, we can take the first 8 characters of the third group.
EDIT:
Just to comment on the splitting - it wouldn't necessarily catch a weirdly named folder. It would work for anything with at least 3 _
s.
Due to changes in the naming convention of SNOMED CT release files. The date/release no longer fits these exact character range.
https://github.com/CogStack/MedCAT/blob/96706c8a88ad0cf2df1a143311a5904d6d8a78ec/medcat/utils/preprocess_snomed.py#L80
Needs to be checked with different releases and extensions before merging.