The data on the Overview page is currently only fetching data from the latest resource on the latest endpoint. ideally we want to consider all active resources.
hiccup 1
datasets with multiple endpoints tend to have their data provided on the endpoints in one of two ways:
each new endpoint is an updated version of an older endpoint, in essence the most recent endpoint should have all the data for all the entities on it
endpoints supply data that is independent of one another, so for example endpoint 1 has the first 50 entities, while endpoint 2 has entities 51 to 100
This is problematic because in the 1st situation we don't care about older endpoints but in the 2nd situation we do.
How do we solve this
we should only show the most recent outstanding issues for any entity.
How do we implement this
Currently the database has no concept of a 'most recent outstanding issue' so we need to get clever and work this out ourselves
** Can we ask infa to add a resource date into the issue field? or even a date into the dataset.resource table
To do this we should first get a list of the most recent issues for each entity by either...
Querying the database directly using something like this
seems pretty good on the face of it though has some flaws that would need to be addressed:
doesn't correctly get issues from non entities
might struggle to execute when there's lots of issues. (could potentially ask infa to add resources and resource_organisation into the dataset database's so we can do joins in there)
Or we could instead do the joins manually in our code
fetch all issues for each active resource one by one in separate lists
in our code, merge these issues, removing any duplicates but prioritizing those in more recent resources
This gets us the most recent issues, but some issues that have been fixed will remain, so we need to filter them out by...
for each of the most recent issues by entity and field that we now have
if the issue comes from the most recent resource, we know its still an issue
else: look to see if there's a fact for said entity and field that was supplied after this issue. if there was then we know the issue has been fixed and can be discarded
Final hurdle
Sometimes we generate issues from entries that don't make it to entities, because for example no reference was provided. in these cases, there is no way to know if this issue has been resolved in a later resource.
This is going to need some more thought, we can't simply show all these issues regardless of what endpoint they were provided on as even if they fix it in a more recent endpoint, the issue will still exist in the old endpoint. and our platform would display that.
I suggest that for now with this, we only generate tasks out of these issues when they are present in the most recent endpoint.
Future suggestion (to put on the back burner): if this issue exists in an older endpoint we could highlight it as a 'potential issue/task'. this wouldn't change the status of the dataset from 'live' to 'needs fixing' but might still display somewhere on the site. (however this would require some design work)
Documenting this
Once work is about to be started we should make a log in our system design decision log
Background
The data on the Overview page is currently only fetching data from the latest resource on the latest endpoint. ideally we want to consider all active resources.
hiccup 1
datasets with multiple endpoints tend to have their data provided on the endpoints in one of two ways:
This is problematic because in the 1st situation we don't care about older endpoints but in the 2nd situation we do.
How do we solve this
we should only show the most recent outstanding issues for any entity.
How do we implement this
Currently the database has no concept of a 'most recent outstanding issue' so we need to get clever and work this out ourselves ** Can we ask infa to add a resource date into the issue field? or even a date into the dataset.resource table
To do this we should first get a list of the most recent issues for each entity by either...
Querying the database directly using something like this
This gets us the most recent issues, but some issues that have been fixed will remain, so we need to filter them out by...
Final hurdle
Sometimes we generate issues from entries that don't make it to entities, because for example no reference was provided. in these cases, there is no way to know if this issue has been resolved in a later resource.
This is going to need some more thought, we can't simply show all these issues regardless of what endpoint they were provided on as even if they fix it in a more recent endpoint, the issue will still exist in the old endpoint. and our platform would display that. I suggest that for now with this, we only generate tasks out of these issues when they are present in the most recent endpoint.
Future suggestion (to put on the back burner): if this issue exists in an older endpoint we could highlight it as a 'potential issue/task'. this wouldn't change the status of the dataset from 'live' to 'needs fixing' but might still display somewhere on the site. (however this would require some design work)
Documenting this
Once work is about to be started we should make a log in our system design decision log