[PR] Puerto Rico Department of Health testing data API

sacundim commented 4 years ago

State or US: Puerto Rico

Just saw an informal announcement from a biostatistician collaborating with Puerto Rico's Department of Health that they now have an API for downloading test data:

Tweet with the info: https://twitter.com/rafalab/status/1284292056697929730
API: https://bioportal.salud.gov.pr/api/administration/reports/minimal-info-unique-tests

Note that when I hit the latter URL with a regular browser I get an authorization failure message, but when I wget from it I get a 48.78M JSON data file:

$ wget https://bioportal.salud.gov.pr/api/administration/reports/minimal-info-unique-tests
--2020-07-17 18:20:34--  https://bioportal.salud.gov.pr/api/administration/reports/minimal-info-unique-tests
Resolving bioportal.salud.gov.pr (bioportal.salud.gov.pr)... 54.210.0.6
Connecting to bioportal.salud.gov.pr (bioportal.salud.gov.pr)|54.210.0.6|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/json]
Saving to: ‘minimal-info-unique-tests’

minimal-info-unique-t     [                    <=> ]  48.78M  4.06MB/s    in 13s     

2020-07-17 18:21:46 (3.79 MB/s) - ‘minimal-info-unique-tests’ saved [51154578]

$ cat minimal-info-unique-tests |jq |head -n 30
[
  {
    "collectedDate": "7/2/2020",
    "reportedDate": "7/3/2020",
    "ageRange": "60 to 69",
    "testType": "Molecular",
    "result": "Negative",
    "patientCity": "Ceiba",
    "createdAt": "07/07/2020 13:34"
  },
  {
    "collectedDate": "6/24/2020",
    "reportedDate": "6/25/2020",
    "ageRange": "60 to 69",
    "testType": "Molecular",
    "result": "Negative",
    "patientCity": "Bayamón",
    "createdAt": "06/26/2020 11:26"
  },
  {
    "collectedDate": "6/10/2020",
    "reportedDate": "6/16/2020",
    "ageRange": "50 to 59",
    "testType": "Molecular",
    "result": "Negative",
    "patientCity": "San Sebastián",
    "createdAt": "06/16/2020 17:07"
  },
  {
    "collectedDate": "5/26/2020",

The file is a single big JSON list with individual records for each test. I saw an earlier version of this data the other day (it was shared as an Excel file) and there's some data quality issues. Summary of the ones I'm aware of:

result field not normalized. I just treat the ones that match Positive anywhere inside as positive, rest as negative.
A couple thousand records (out of 275K+) don't have some of the date fields.
Some records have individually nonsensical date values, like from 1937
A couple thousand records have collectedDate > reportedDate

Here are the cleanups I ended up doing on it if it's of any help:

https://github.com/sacundim/covid-19-puerto-rico/blob/master/postgres/020-load-data.sql#L34

muamichali commented 4 years ago

Wow, @sacundim! Thank you so much for this and all the helpful info you provide us about PR. We are going to follow this closely.

Nosferican commented 4 years ago

Agreed. I watched the governor's press conference yesterday which included some of the work by @rafalab et at. Thanks for sharing the API announcement @sacundim!

Nosferican commented 4 years ago

I seems the license of the data falls under the EULA

No se pueden descargar, republicar, revender, duplicar o hacer "web scraping", en su totalidad o en parte, de los datos de propiedad de los sitios web y / o servicios del DSPR para ningún fin que no sea el uso personal permitido en estos Términos de Uso.

Which basically says it can only be used for personal use. Hopefully we can get the all clear from the copyright holders to incorporate the API in the project (ref: https://twitter.com/Nosferican/status/1284323770694619136).

space-buzzer commented 4 years ago

All the tests are tagged as Molecular, so I can't understand the result field:

Negative
Positive 2019-nCoV
COVID-19 Negative
COVID-19 Positive
Positive
Inconclusive
Invalid
Other
Not Valid
Not Tested
Presumptive Positive : Does it mean that the test is inconclusive? "presumptive positive" by CDC's definition?
Positive IgM Only
Positive IgM and IgG

So "Molecular* test type does not mean PCR?

sacundim commented 4 years ago

I'm afraid I haven't seen any real documentation, @space-buzzer. The number of records in the spreadsheet I saw earlier this week was about 275K, which is consistent with all tests being PCR based on other information.

The approach I took, personally, is that the strange values like "Positive IgM Only" and "Positive IgM and IgG" are very infrequent so they don't make a big dent on the counts, which are only going to be an approximation anyway due to issues with date data. I personally assumed that they're positive PCR results.

Some context (that I don't fully understand): the data are extracted a the PRDoH's website ("Bio Portal") that has a web form for labs to report test results. So I think these are probably a mix of data entry errors by users and bugs in that web UI that allowed weird values to be entered.

sacundim commented 4 years ago

Oops, I hit "Close issue" by accident.

sacundim commented 4 years ago

@Nosferican Do you have a link to that EULA?

Nosferican commented 4 years ago

@Nosferican Do you have a link to that EULA?

It's under the terms and services agreement when signing up for an account at the bioportal.

sacundim commented 4 years ago

I am not a lawyer nor a professional translator, but I'll take a stab at this specific section:

Intellectual property

Most of the content of our web sites is public domain and doesn't include copyright or other intellectual property statements.

The government's information is normally public domain. Public domain information can be used and copied freely. However, you're authorized to use our contractor's computer software and the related database only for informational ends and not for direct or indirect commercial purposes. Also, other materials in the web sites and/or services that is not government material cannot [sic] be copyright protected and cannot be used for any direct or indirect commercial purpose. Any use not authorized in this document is forbidden. Any copy made of materials must conserve all copyrights and other warnings. Except as explicitly provided in these Terms of Use, you may not reproduce, modify, publish, transmit, show, produce, distribute, disseminate, disseminate [sic] or disseminate any materials to third parties, nor participate in the transfer or sale, create derived works or in any manner exploit the contents of the web sites and/or services of the PRDoH or in any part of these without prior written consent by PRDoH.

I've put "[sic]" on some bits that rendered literally but understand are flat out wrong. In particular, the part that says that "other materials in the web sites and/or services that is not government material cannot [sic] be copyright protected and cannot be used for any direct or indirect commercial purpose" is just contradictory and I think it is intended to read "may."

I think one important piece of context is that this Bioportal web site is normally used as a web UI for:

Doctors to order tests from the PRDoH labs and view their results.
Private labs to submit their test results to PRDoH.

And they've apparently added a public API endpoint to publish the testing data that the system contains.

Original Spanish:

Propiedad intelectual

La mayor parte del contenido de nuestros sitios web es de dominio público y no incluye avisos de derechos de autor u otros avisos de propiedad intelectual.

La información del gobierno normalmente es de dominio público. La información de dominio público se puede distribuir y copiar libremente. Sin embargo, está autorizado a utilizar el software de la computadora de nuestro contratista y la base de datos relacionada solo con fines informativos y no con fines comerciales directos o indirectos. Además, otro material en los sitios web y / o servicios que no sea material del gobierno no puede estar protegido por derechos de autor por entidades privadas y no puede usarse para ningún propósito comercial directo o indirecto. Cualquier uso no autorizado en este documento está prohibido. Cualquier copia que haga de los materiales debe conservar todos los derechos de autor y otros avisos. Salvo lo dispuesto expresamente en estos Términos de Uso, no puede reproducir, modificar, publicar, transmitir, mostrar, realizar, distribuir, difundir, difundir o transmitir ningún material a terceros, ni participar en la transferencia o venta de, crear trabajos derivados o de cualquier manera explotar el contenido de los sitios web y / o servicios del DSPR o cualquier parte de estos sin el previo consentimiento por escrito del DSPR.

Nosferican commented 4 years ago

In this case, a simple written notice from the copyright holder / publisher saying the generic catch-all EULA doesn't apply to the dataset and that it can be used freely should suffice.

The generic boilerplate seems to suggest the API may not be that well maintained. However, if we can get a contact person / data steward for basic questions such as the ones @space-buzzer mentioned, that would be great. The data curation can occur upstream but that still requires some basic data lineage / governance information. @rafalab might be able to help with making that initial contact with the dept.

sacundim commented 4 years ago

Observations from playing a little bit with the API:

The createdAt field might be more reliable than reportedDate. With two downloads that I've done about 8 hours apart on 2020-07-17 and 2020-07-18, I'm noting some differences in aggregates for createdDate and reportedDate between 2020-07-09 to 2020-07-16 (inclusive). Looks like there is some lag such that data with reportedDate in this range was still being added to the data set on the night of the 17th to 18th.
The timezone of the createdAt field seems to be UTC.

edusoccer1121 commented 4 years ago

Hi @Nosferican,

@gabynevada can help clarifying some of the provided data he is the lead maintainer of the public API.

Good day.

space-buzzer commented 4 years ago

I set a cron job to download it daily, but noting will happen until we figure out:

Licensing issues
What kind of tests are we talking about?
Consistent data sanitizing strategies we can take

gabynevada commented 4 years ago

Hey all, I'm the project manager for the BioPortal system.

I'll start inquiring on what we need on the legal side for this to proceed.

On the data side, we're on the process of cleaning it up as much as possible.

The data is directly reported from the island's testing facilities.
If it comes via interface it comes in mostly as is, later we run missing or incorrect value identification and send notifications to the testing facilities to fix their reported data.
We have teams cleaning up the data, but are focusing on the tests for the positive Covid-19 cases as those are of the most importance to us.
Data may change as it's cleaned/verified by our personnel or the testing facilities.

Some answers to the questions here:

Molecular in this case are PCR tests.
The test results vary depending on the kit being used.
Presumptive Positive in most cases means that it requires additional confirmation, it depends on the kit that was used
The createdAt field is just a timestamp of when it was reported or uploaded into the system, it has no other correlation to the data.
The missing collectedDate and reportedDate fields are caused by testing facilities not reporting all required data. The system was implemented after the pandemic had begun and the prior process was mostly manual. We're still on the process of improving the data collection methods for the tests.

I'm available for any questions and i'll see what I can do to help in this process.

Regards

muamichali commented 4 years ago

Hey @gabynevada! Thank you so much for interacting with us here. I will send you an email so we can have a better way to find out what is the status on the legal side.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions!

Nosferican commented 4 years ago

Any updates on this? @gabynevada, any updates on the data quality issues?

gabynevada commented 4 years ago

@Nosferican The test results that did not matched with the test types were verified and cleaned. We're still working on additional interfaces with Laboratory Information Systems to get as much electronic information as possible.

We're working with the legal side on the Terms and Conditions on the API. It's taking longer than expected.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions!

muamichali commented 3 years ago

I believe that @space-buzzer has found a way to upload the PR API info into our auxiliary dataset after some cleanup steps. I am closing this issue. @space-buzzer if there is public documentation of this work, can you please link it here?

COVID19Tracking / issues

[PR] Puerto Rico Department of Health testing data API #645

Intellectual property

Propiedad intelectual