provide changes from v6 to v8 [openaire]

sergiocontrino commented 7 months ago

Check changes mdr v6 vs mdr v8
These appear to be new fields in version 8, see a quick comparison from zenodo papers

A.16 Study Countries (0..n) -- A.17 Study Sites (0..n) A.18 Study Start Time (0..1) A.20 Study Conditions (0..n) A21. Study ICD (0..n) A.22 Study People (0..n) A23. Study Organisations (0..n) A.24 Study IEC (0..n) A25. Study IEC level (1)

There is actually a log of changes for json files between version 6 and 7, which seems to confirm the above list.
6->7 (json files) wiki
Addition of country and location attributes
For ‘topic’ records – both study_topics and object_topics – the original controlled terminology (CT) code and controlled terminology code have been restored to the schema
Inclusion of study start time
study contributors are part of the study record; direct object contributors are part of the object data
Check changes [mdr v6](https://zenodo.org/records/5554961) vs [mdr v8](https://zenodo.org/records/8368709) These appear to be new fields in version 8, see [a quick comparison from zenodo papers](https://docs.google.com/spreadsheets/d/1WvhrlLMBorOJpIfvgmR9KLCcz5WvN6a5xROshhIqm0k/edit?usp=sharing) A.16 Study Countries (0..n) A.17 Study Sites (0..n) A.18 Study Start Time (0..1) A.20 Study Conditions (0..n) A21. Study ICD (0..n) A.22 Study People (0..n) A23. Study Organisations (0..n) A.24 Study IEC (0..n) A25. Study IEC level (1) Also changes to contributors, to be investigated: Creators and Contributors V8 Creators and Contributors V6 C.3 Object People (0..n) C.1 Creators (1...n) C.4. Study Organisations (0..n) C.2 Contributors (0...n) There is actually a log of changes for json files between version 6 and 7, which seems to confirm the above list. 6->7 (json files) [wiki](https://ecrin-mdr.online/index.php/JSON_files_v6_to_v7_changes) Addition of country and location attributes For ‘topic’ records – both study_topics and object_topics – the original controlled terminology (CT) code and controlled terminology code have been restored to the schema Inclusion of study start time study contributors are part of the study record; direct object contributors are part of the object data

sergiocontrino commented 7 months ago

from the mdr documentation:

Changes, version 7-8

Last edited: 21/09/2023

The most recent revision of the schema, to version 8, took place in the context of a general re-write of the entire MDR. As such it was a good opportunity to introduce some relatively substantial changes to the schema, some of which that had been ‘pending’ for a long time. Several of these changes involved adding additional data points, and some involved splitting data up, partly to reduce the volume of very large tables in the system, partly to clarify the purpose of some data points. It is hoped that this will be the last major revision of the schema.

General changes Splitting Contributor Types Following DataCite, we have always used ‘Contributors’ as a general term for all contributor types, whether they be individuals (e.g. study leads or authors) or organisations (e.g. study sponsors or funders), and stored them all together. The difficulty with this is not only that it can produce some very large contributor tables, but also that the data stored for each of the two contributor types is different. In particular, an organisation contributor record has empty name, ORCID and affiliation fields, so such records can waste database and system resources and are more difficult to read. In addition, while a few contributor types can apply to both organisations and people, in general the contributor data for organisations and people is collected and processed separately.

The decision was therefore taken to split the storage of these two types of contributor into two separate tables, allowing each, but especially the organisation contributor type, to be simplified. In particular, the boolean field that indicated whether the contributor was an individual or not is now no longer needed. Though not strictly necessary it was also decided to split the two contributor types in the schema as well, because data is likely to be presented as two distinct categories of contributor rather than all together. Having said that, if data usage required it, it would be straightforward to recombine the data back into a generic contributor records, as and when necessary.

The way in which people and organisation data is returned within JSON statements was also simplified. All the changes apply to the contributor data for both studies and objects.

Changing the Date String format Dates and date ranges in the MDR are stored and can be represented by 6 numbers (year, month and day for each of start and end dates) but they have also always been represented by a string, because in many contexts (e.g. the date an identifier was applied) a date is simply a single information point, is not used for filtering or sorting, and is easier to read on screen as a conventional date string.

The problem is that there are several such ‘conventioonal date strings’, which are culture dependent. Some are also notoriously ambiguous. Since its inception the MDR chose the date string format used by ClinicalTrials.gov as its ‘standard’ and has presented all dates, when used as strings, in this format. It is, however, a rather unusual format, in the form YYYY MMM dd, i.e. a 4 digit year followed by a three character month abbreviation and then the day number, e.g. ‘2017 Jun 6’. It is similar to the ISO-8601 standard (YYYY-MM-dd) but with the month abbreviation replacing the month number.

In early 2023 ClinicalTrials.gov announced that they were changing to a new date representation for the new versions of their web site and their API. It is the ISO-8601 format. This is welcome news and makes date processing from this source easier, but the YYYY-MM-dd format is still relatively unfamiliar to most people as a string representation of a date. The opportunity was therefore taken to cchange the MDR’s string representation to the European convention of dd MMM YYYY, e.g. ‘6 Jun 2017’. It was felt that this would be much more natural and readable for most users.

This change applies to both study and object data, wherever a string representation od a date is used. It is not a change in the schema per se, just a change in the way some fo the data is presented.

Simplification of Topic data Topic data had gone through various changes in recent revisions. One additional simplification was included this time, when the mesh_coded boolean field was removed from all topic records. It was felt superfluous as the presence or absence of a MESH code would indicate whether coding had been possible or not. In addition, the structures used to return topic data in JSON statements was modified, so that related fields are more clearly grouped.

The changes should have been made as part of the previous revision, so this is just completing the process began there. It affects both study and object topic data. In addition, as described below, the study topics dealing with ‘conditions’ were split off into a new table.

Study data changes Addition of IEC Data Perhaps the biggest single change to the schema on this revision - certainly in terms of data volumes - was the addition of inclusion and exclusion criteria. Like the addition of geographic data in the previous revision, this was triggered by an external request, and in particular the involvement of ECRIN within the Horizon Europe EOSC 4 Cancer project. The MDR now not only ingests inclusion and exclusion criteria (IEC) it also processes it to try and identify - so far as possible - the individual criteria statements. The process is not perfect because of the huge variability in the ways in which IEC is exppressed in trial registry entries (it is usually cut and paste from protocol documents) but about 80% of the IEC statemnents are correctly identified. The rest may be over fragmented or not split properly from neighbopuring criteria.

Each IEC record includes, as well the criterion text and a reference to the parent study, a sequence number to indicate its position in the overall list of criteria, a ‘type’ code that indicates if it is an inclusion or exclusion criterion (or header, or supplementary statement), any leading character or set of characters, e.g. a number, letter or bullet, an indication of how splitting occured, e.g. on a carriage return, or on a leading character within a longer string, and an indication of the criterion’s indentation level, as criteria often include nested sets of sub-criteria. In addition the study record now includes a numerical field called iec_level, which indicates the granularity of the inclusion and exclusion criteria (considered separately) present in the source data.

The main difficulties with the IEC data have been the complexity of the processing required and the volume of the data. In total, about 9.5 million IEC retcord were obtained from the studies in the MDR. This data is not aggregated witrh the rest of the study data - the volumes involved would slow that process down considerably and lead to a huge final database. Instead the IEC data is asggregated from each source to a separate ‘iec’ database. In addition, the iec from different versions of the same study are left as they are, with no attempt to rationalise them into a single list. Whilst a human could easily do that, it is well beyond the capabailities of a machine without a deep semantic analysis of each criterion statement. If a study is in more than one trial regiustry, than the IEC data from the ‘most preferred’ study would be displayed.

Note that the Study JSON definition does not contain a reference to the IEC data. This reflects the fact that this data is not normally returned with the rest of the study data. Instead the intention is to return it as a set of records for each study, on demand. A JSON definition of the returned IEC data is provided under the study JSON definition - it is a simple flat structure that directly reflects that of the source records.

Splitting off of ‘Conditions’ topic data The extensive study topic data in the MDR, sometimes put in as ‘key words’, sometimes pre-coded as MESH codes, sometimes listed as ‘interventions’, or ‘conditions’ in the source data, has always been heterogeneous. An important subset of this data (and for many WHO registries it is the only ‘topic data’ included) are the ‘conditions’ that a study is aimed at treating, or is in some way about. Experience has also shown that ‘conditions’ are important to users - it is often the studies related to a particular diagnosis that they are interested in finding. To help split increasingly large topic tables into more manageable sizes, and to make searching for studies related to specific conditions easier, it was decided to split the conditions topic data from the rest, both storing it in a separate table and identifying it as separate data point within the metadata schema.

The fundamental problem with condition related data is that it is expressed in a variety of ways, using a variety of different controlled vocabularies, or none at all, and with a very wide range of granularity. A listed condition may be as wide as ‘cancer’, or as narrow as ‘Early-Stage Breast Cancer with Tumor PIK3CA Genotype’. This makes searching for a particular diagnosis very problematic - it might be subsumed in the wider categories being used, or itself subsume some of the narrower categories. A straightforward matching of terms is therefore very likely to miss many relevant entries. To start to construct a mechanism whereby users can search at a consistent level, travelling up the hierarchy or tree of conditions if desired, it was decided to try and code condition data using ICD 11 stem codes. These are the approximately 5000 4 characcter codes used within ICD that provide a reasonable level of differentiation of medical conditions without over-whelming the user with all the possible variations and sub-types that have been recognised. Rather than using MESH coding, therefore, condition data is coded, where possibele, using ICD 11, and an ICD code and term are included fields in the study conditions table. Because mapping tables to ICD 11 are relatively rare it is a case of manually mapping many condition terms to ICD 11 codes. This process has begun but is likely to take several months before it reaches reasonable (75% - 80%) levels of coverage.

Introduction of ICD data One of the issues that becomes clear once ICD coding is done is that a study may list 5 or 6 different relevant conditions, but collectively these map to only 1 or 2 different ICD stem codes. Because the ultimate aim is to allow a user to input a stem code, or group of such codes, as an efficient search parameter - perhaps using an ICD browser to identify the relevant codes - the system needs ‘ICD data’ that relates each study only to the distinct ICD codes that it is linked to. Thus, while the system retains all the original condition terms, which are available for text based searching, a separate study_icd table was also established, to support this type of searching in the future.

The study_icd table contains only study Ids and the linked ICD codes and terms. It is created after the end of the aggregation process, so that studies with multiple registry entries are all considered together. Note that the ICD codes within the study conditions data then becomes redundant and does not need to be returned as part of that list (though it remains within the source database tables). In effect the system is designed to return two types of conditions data - one listing all the condition terms that have been used, and a second, often smaller, listing of the corresponding ICD codes and terms. The Levels of ICD coding will need to increase substantially, however, before this system becomes fully functional and useful as the basis of a search mechanism. The collection and organisation of the present data should be seen as part of the preparation for that step

Simplification of Country and Site data The way in which Country anmd Site data is returned within JSON strings was simplified, and made ‘flatter’.

Data Object data changes Simplification of object title data This relates to the problem that while studies have one, and often two or more real titles, one of which is selected as the ‘display title’, most data objects, with the exception of journal papers, do not have a title at all. The system has therefore always created a ‘display title’ for objects, essentially what is shown in the system as a header for object related data. In most cases the manufactured displayed title is an amalgam of the study name and the object type: Study X :: Protocol, Study Y :: Trial Registry Entry. In cases where an object does have a name, the study name is still used to differentiate the object from other similarly named objects and to clearly place it with the source study, e.g. Study A :: Protoocol 001, or Study B :: 2021 Results Summary. Ironically, for journal articles, which do have a real title, the display_title is a full citation, as generated by the MDR from the title, the author data, and the journal / issue data, in a fixed format. This is because journal papers are normally cited and viewed as citations rather than just using the paper’s title, and are also often found in source data in a citation format. Sample collection objects also have a collection name, but this is non-unique and so variable in form that it is a poor basis for the display_title. Instead such a title is constructed using the sample id ad the biobank name.

There is also an object_titles table that traditionally has collected details of both real and constructed names for data objects. For most objects however (i.e. with the exceptions of journal articles, sample collections, and a small minority of documents, these object titles are a fiction and present no additional data over the display_title.

The decision was therefore to drop the purely constructed study name + object type ‘titles’ from the object_titles table and use it only store real object titles, whether or not those titles are part of the display_title. That greatly reduces the size of this table in most cases, and makes its purpose clearer. The titles data can therefore be presented to a user with meaningful content rather than manufactured titles.

sergiocontrino commented 7 months ago

Changes, version 6-7

Last edited: 21/09/2023

Addition of country and location attributes For studies, the country or countries where participants were recruited is now included. For a few sources, chiefly the EU CTR, the status of the study in that country (e.g., ‘ongoing’ or ‘completed’) may also be given.

In addition, where the data exists in the source material - for the moment only within ClinicalTrials.gov data and ISRCTN - the clinical sites for the study are also listed, including the city and country of the site and the status as of the most recent data harvesting.

Internally within the system integer Geonames ids are used for countries and cities (see https://geocode.xyz/). For display and within the schema the city and country names are also included. Facilities listed within locations are ROR coded wherever possible.

Changes for topic records For ‘topic’ records – both study_topics and object_topics – the original controlled terminology (CT) code and controlled terminology code have been restored to the schema (these were never removed from the data). In most cases the CT will be MESH (code = 14) but in some cases MedDRA and ICD codes, and very occasionally a few other CTs, are used. Returning these datapoints to the schema simply allows them to be displayed if and when required.

Inclusion of study start time The year and month the study started has always been part of the data extracted, where it exists in the source material. This information has now been added to the schema and so is now available for export to the MDR UI and other systems.

Clarification of Study and Object Contributors Within the MDR databases there has always been a clear distinction between study contributors (study leads, sponsors, funders, etc.) and object contributors (chiefly authors of papers). They are extracted and stored separately. When exporting the data objects as JSON files, however, this situation has been muddled. Study contributors were not exported at all, and instead were ‘given’ to linked data objects, as contributors to the generation of the object. In particular, study organisational contributors were given to all data objects, including journal papers (that have their own authors), and both organisational and individual study contributors were given to non-article data objects, which in almost all cases do not have any contributors specified in the source data. This was not unreasonable - study contributors do contribute, indirectly, to the generation of all data objects, and it allowed the object data to more easily match the expectations of the DataCite schema - but it is now beginning to cause some issues. There has therefore been a change, to a simpler organisation of the exported data that more accurately reflects the data sources. The reasons for the change include:

With the advent of the RMS, and perhaps other mechanisms for capturing object metadata directly, there is an opportunity to identify the real contributors to any particular object, rather than assume they were created in some way by the whole study management team. At a time when there is increased interest in attributing data generation and rewarding its re-use, capturing accurate data on object authorship, beyond paper authorship, is increasingly important. It is confusing for potential external users of the data, who, from previously published schema descriptions, were unable to see that separate study contributor data existed. Without a good understanding of how the object contributor data was constructed the risk was that they would misinterpret it. There is not yet any significant use of this data for searching or filtering purposes. This re-organisation needs to take place now rather than after filtering / searching systems are constructed that use contributor data. Having separate study and object contributor data, rather than having it all in one place, may make such mechanisms a little more complex, but it makes the data itself much more accurate. The contributor data is therefore now exported as it is stored in the MDR database - study contributors are part of the study record and exported within it, while only direct object contributors - almost entirely the authors of journal articles - are included within the object data. Please note that it is still possible, if desired or required by an external schema, to export the data with the study contributors added to the study’s objects. It is just no longer the default.

Note also that study topic data is both retained within the study information, to allow studies to be filtered using topic keywords, and is also still added to object data in exported JSON files, for objects other than for journal papers, to allow potential searches like ‘find all datasets / or protocols on topic X’ to take place more directly.

ecrin-github / MDR_FuiPortal