The DataStream data downloaded in the Export has wrong JSON format and missing data

bardram commented 1 week ago

When exporting data, in the data-streams.json file I wold expect to get correctly formatted JSON according to the CARP Core domain model. But - alas - there are issues.

Here is an example of some JSON exported:

data-streams.json

```json { "id": 4961, "data_stream_id": 631, "snapshot": { "syncPoint": { "synchronizedOn": "1970-01-01T00:00:00Z", "relativeClockSpeed": 1.0, "sensorTimestampAtSyncPoint": 0 }, "triggerIds": [ 4 ], "measurements": [ { "data": { "__type": "dk.cachet.carp.freememory", "freeVirtualMemory": 2999164928, "freePhysicalMemory": 164835328 }, "sensorStartTime": 1729864247872632 }, { "data": { "__type": "dk.cachet.carp.freememory", "freeVirtualMemory": 2952347648, "freePhysicalMemory": 117329920 }, "sensorStartTime": 1729864307857247 }, { "data": { "__type": "dk.cachet.carp.freememory", "freeVirtualMemory": 2957910016, "freePhysicalMemory": 124141568 }, "sensorStartTime": 1729864367870355 }, { "data": { "__type": "dk.cachet.carp.freememory", "freeVirtualMemory": 2950742016, "freePhysicalMemory": 113905664 }, "sensorStartTime": 1729864427889803 } ] }, "first_sequence_id": 29, "last_sequence_id": 32, "created_by": "20e2093c-e510-4e79-9385-478a19dc4723", "created_at": "2024-10-25T13:54:43.080504Z", "updated_by": "20e2093c-e510-4e79-9385-478a19dc4723", "updated_at": "2024-10-25T13:54:43.081204Z" }, { "id": 4605, "data_stream_id": 655, "snapshot": { "syncPoint": { "synchronizedOn": "1970-01-01T00:00:00Z", "relativeClockSpeed": 1.0, "sensorTimestampAtSyncPoint": 0 }, "triggerIds": [ 9 ], "measurements": [ { "data": { "__type": "dk.cachet.carp.triggeredtask", "control": "Start", "taskName": "Task #17", "triggerId": 9, "destinationDeviceRoleName": "Polar HR Sensor" }, "sensorStartTime": 1729776088592930 } ] }, "first_sequence_id": 1, "last_sequence_id": 1, "created_by": "20e2093c-e510-4e79-9385-478a19dc4723", "created_at": "2024-10-24T13:31:24.977530Z", "updated_by": "20e2093c-e510-4e79-9385-478a19dc4723", "updated_at": "2024-10-24T13:31:24.979868Z" }, ... ```

Compared to the CARP Core Domain data model there are several issues with this JSON:

The JSON uses a mixture of camelCase and snake_case serialization. Default JSON in CARP Core is always camelCase. For example, first_sequence_id should be firstSequenceId (see DataStreamSequence).
The data_stream_id (which should be dataStreamId) is a JSON object (and not a number). See DataStreamId for details.
The JSON contains data that is not part of the CARP Core domain model. For example id, last_sequence_id, created_by, etc.
The snapshot is not part of the DataStreamSequence domain model. However, the SyncPoint is but is embedded in this weird "snapshot" thing.

bardram commented 1 week ago

@yuanchen233 - I think the main problem here is that data is exported from the database directly rather than from the DataStreamService. We have talked about this before - the approach of bypassing the domain logic in CAPR Core, and I really do not like this. It goes against everything that DDD recommends.

To quote Working_with_models:

In domain-driven design, an object's creation is often separated from the object itself.

A repository, for instance, is an object with methods for retrieving domain objects from a data store (e.g. a database). Similarly, a factory is an object with methods for directly creating domain objects.

I am sure @Whathecode will agree ;-)

Whathecode commented 1 week ago

Yup. The endpoints on that service are severely lacking, however, but this should be a prompt to add/work on those.

yuanchen233 commented 1 week ago

The JSON uses a mixture of camelCase and snake_case serialization. Default JSON in CARP Core is always camelCase. For example, first_sequence_id should be firstSequenceId (see DataStreamSequence).

This is caused by JsonNaming is set to SnakeCaseStrategy, and all column generated from Auditable is using the snake naming schema. It is unsure to me if those extra field is still needed.

The data_stream_id (which should be dataStreamId) is a JSON object (and not a number). See DataStreamId for details.

The JSON contains data that is not part of the CARP Core domain model. For example id, last_sequence_id, created_by, etc.

The snapshot is not part of the DataStreamSequence domain model. However, the SyncPoint is but is embedded in this weird "snapshot" thing.

This is cuased by webservices' wired design of dataStream database table/entity, we have this 3 table and the notion of snapshot inherited from old Java project, and also plan to change it at #137

I think the main problem here is that data is exported from the database directly rather than from the DataStreamService. We have talked about this before - the approach of bypassing the domain logic in CAPR Core, and I really do not like this. It goes against everything that DDD recommends.

Agreed. Currently, the export endpoint ( the export button on the portal ) performs a database dump. It does not utilize the API of each ApplicationService to retrieve data, and this should be changed. This change was proposed but postponed due to the need to maintain a consistent format for the exported data of active studies.

Yup. The endpoints on that service are severely lacking, however, but this should be a prompt to add/work on those.

Most of the issues are caused by the current design of database, given that they works, retriving data through core endpoints like DataStreamServicesRequest.GetDataStream should work properly without exposing those extra information, despite the fact that the database do need some re-design.

In short, I think there are a few things we could/should do, some of them might be off topic here:

[ ] Use core dataStream endpoint to retrieve data, not export
[ ] Export feature in webservice should also use core endpoint to retrieve data, instead of a database dump
[ ] Re-design dataStream database, stop using this added snapshot
[ ] Revisit Auditable and decided if it's still needed for the current implementation of webservices
[ ] At somepoint, a re-design of all database is also desired. Currently all data is saved as snapshots which is created using core Json serializer. This is not very efficient and does not utilize any advantage a database provide, and the process time is way too long when the number of instances go up. (e.g. ReruitmentSnapshot keeps a list of all participantGroup, when we have 20000 participantGroups and a new participantGroup is created, it takes huge amount of time to build the Recruitment object, updata it and convert it back to Json and save it)

If we decide to re-design the database later, some word of suggestion from @Whathecode would be very helpful and much appreciated.

yuanchen233 commented 4 days ago

Current webservices' dataStream database implementation: DataStream_UML_Final

cph-cachet / carp-webservices-spring

The DataStream data downloaded in the Export has wrong JSON format and missing data #178