Dataset - Githubissues

dblana commented 3 years ago

What data do we need to answer our research questions (#8)? We have permission from the North Node Privacy Advisory Committee to access data from a number of sources (e.g. TrackCare, Scottish Morbidity Record (SMR), National Records of Scotland (NRS) Deaths) as part of a COVID-19 modelling project funded by NHS Grampian Endowments.

BScheliga commented 3 years ago

It might be obvious, but the spatial resolution of the data should be ideally datazones, not larger. Postcodes would work as well, as we can upscale the spatial resolution of the data by linking the postcodes to the respective datazones and then aggregate the data. Any spatial resolution larger than datazones, means we have to upscale the spatial resolution of the Social Distancing Score for Grampian (SDS-G) to a coarser resolution.

The spatial data should reflect the residency of the people infected with Covid-19.

will-ball commented 3 years ago

From the questions as they stand:

1. Testing

date conducted
positive/negative
CHI of testee (to link to demographic data)
datazone of residence (to derive total numbers of tests by datazone)
datazone of testing centre?

2. Mortality

Date of death
Positive COVID in 28 days or less identifier
COVID mentioned on death certificate identifier
Something to link this to test data?

@dblana - do you know if the NRS death certificate data provides a date of death as well as a date of registration of death as there is sometimes a lag in this?

dblana commented 3 years ago

Yes, the date of death should be provided too in the NRS data. I don't know about the location of the testing centre... We'll need to ask @gosler01 - at the moment I have data mostly on positive tests conducted at the hospital. But we do have permission to access all tests (positive and negative, in the hospital and the community). And yes, we will be able to link deaths to test data.

dblana commented 3 years ago

Besides testing and mortality data, we also need Jess's Grampian Social Distancing Score (SDS-G) data. At the moment, @JessButler doesn't save any data into output csv files (her code is here), but before I ask her to do that (or even better: fork --> branch --> add the code we want myself --> pull request 😄), we need to make sure we know what we need.

The dataframe "grampian_index" contains most of the information we want (datazone, population, SIMD, the three measures that make up the SDS-G, the SDS-G itself and deciles). It doesn't contain any spatial information about the datazones (i.e. where they are with respect to each other) - that's in aberdeen_sf, aberdeenshire_sf and moray_sf, from the Grampian datazone shapefiles.

What format does the spatial info need to be in for the stats analysis (#9) @BScheliga @Zeiou?

will-ball commented 3 years ago

After our discussion with Graham on Monday about the data that's available, one of the things that struck me was that we may have access to 3 COVID test modalities:

PCR (swabs)
Antibody (blood test)
Lateral Flow (quick 'at home' test)

I would argue that we need to focus in on just one of those to avoid comparing apples & oranges. I propose that we look only at the PCR tests. Firstly, because PCR is the most commonly used at all time periods of the pandemic and is the standard method of confirmation for symptomatic cases. Antibody testing would be interesting but I don't think it has been used in the same way and the availability of such testing is quite limited. Likewise, lateral flow tests have been applied under specific circumstances with repeat tests mostly for workers. They have some quality issues too, with potential for false positives and false negatives, which means +ve tests for that are generally follow up by confirmatory PCR tests.

Any thoughts?

Zeiou commented 3 years ago

@dblana The spatial information we need is like this figure. It will require to know each area's border. In this figure, B1 area shares a common border with B2 and B3. For areal unit method, if the area share a common border with others then it is 1, otherwise is 0. We will need to create a W n*n matrix with (i,j)th entry wij denoting the spatial closeness between region Bi and Bj. Therefore, in this example, w12=w13=1. And others=0, since they don't share any border with B1.

dblana commented 3 years ago

@Zeiou thank you so much, this is such a great explanation! 👏 We probably need to add this as a task (issue) then: calculate this matrix based on the data from @JessButler

dblana commented 3 years ago

@will-ball I agree re: tests. Let's only look at PCR.

Zeiou commented 3 years ago

@dblana I forgot to say that the wij in the W matrix denoting the spatial closeness between region Bi and Bj, therefore always set wii=0 as an area cannot be a neighbour of itself.

Here is code for the sharing a common border method to create W matrix in a model:

W.nb <- poly2nb(sp.dat, row.names = rownames(sp.dat@data)) W.list <- nb2listw(W.nb, style = "B")

The sp.dat@data is the data you use when you plot the map using leaflet. It is also the data merge with .shp file and data file. I hope it helps.

dblana commented 3 years ago

Thanks @Zeiou! I don't completely follow what sp.dat@data is, but that will hopefully become clear when we start writing the code. I'm hoping to start that as soon as we decide on the folder structure (see #16)!

BScheliga commented 3 years ago

I hope that quick and rough introduction into geographic information system (GIS) files or more specifically the ERSI shapefile format (.shp) helps a bit. But first two definitions 1) in GIS context a feature can be a polygon, line, point; and 2) an attribute is information linked to the features.

E.g. in our context geometry of the datazone = feature; name or SDS-G of the datazone = attributes.

“A Shapefile consists minimally of a main file, an index file, and a dBASE table. In the main file, the geometry for a feature is stored as a shape comprising a set of vector coordinates. This main file is a direct access, variable-record-length file in which each record describes a shape with a list of its vertices. In the index file, each record contains the offset of the corresponding main file record from the beginning of the main file. Attributes are held in a dBASE format file. The dBASE table contains feature attributes with one record per feature. Attribute records in the dBASE file must be in the same order as records in the main file. Each attribute record has a one-to-one relationship with the associated shape record.” [1]

.shp -- Main file (mandatory); a direct access, variable-record-length file in which each record describes a shape with a list of its vertices. [1]
.shx -- Index file (mandatory). In the index file, each record contains the offset of the corresponding main file record from the beginning of the main file. The index file (.shx) contains a 100-byte header followed by 8-byte, fixed-length records.[1]
.dbf -- dBASE Table file (mandatory); a constrained form of DBF that contains feature attributes with one record per feature. The one-to-one relationship between geometry and attributes is based on record number. Attribute records in the dBASE file must be in the same order as records in the main file. [1]

Those files above are all part of the SG_DataZoneBdry_2011.zip. We just load all the files from the zip file. The dBASE Table file has the attribute table/record with the datazone names to each feature (Datazone).

One important things here is: “Each attribute record has a one-to-one relationship with the associated shape record.” In other words, we need a wide table with our data (Datazone_name, SDS-G etc.), where each variable has their own column. Our data table can't have more rows then we have datazones. Once our data is formatted accordingly, we can joined the data to the attribute table of the datazone shapefile using the datazone names as the unique identifier.

I hope the makes sense and answers the question.

Also the one-to-one relationship means you can not plot more the one variable per map (unless you use bivariate choropleth maps ... )

@dblana in @Zeiou case the "sp.dat@data" refers to a variable column (normally called "field") from the attribute table (dBASE table) from the shapefile ("sp.dat"). If I am not mistaken.

Wow, that is not so short and brief as I would have hoped.

Reference: [1] https://www.loc.gov/preservation/digital/formats/fdd/fdd000280.shtml

AbdnCHDS / covid-social-distancing

Dataset #13