Closed sacundim closed 3 years ago
Wow, @sacundim! Thank you so much for this and all the helpful info you provide us about PR. We are going to follow this closely.
Agreed. I watched the governor's press conference yesterday which included some of the work by @rafalab et at. Thanks for sharing the API announcement @sacundim!
I seems the license of the data falls under the EULA
No se pueden descargar, republicar, revender, duplicar o hacer "web scraping", en su totalidad o en parte, de los datos de propiedad de los sitios web y / o servicios del DSPR para ningún fin que no sea el uso personal permitido en estos Términos de Uso.
Which basically says it can only be used for personal use. Hopefully we can get the all clear from the copyright holders to incorporate the API in the project (ref: https://twitter.com/Nosferican/status/1284323770694619136).
All the tests are tagged as Molecular, so I can't understand the result field:
So "Molecular* test type does not mean PCR?
I'm afraid I haven't seen any real documentation, @space-buzzer. The number of records in the spreadsheet I saw earlier this week was about 275K, which is consistent with all tests being PCR based on other information.
The approach I took, personally, is that the strange values like "Positive IgM Only" and "Positive IgM and IgG" are very infrequent so they don't make a big dent on the counts, which are only going to be an approximation anyway due to issues with date data. I personally assumed that they're positive PCR results.
Some context (that I don't fully understand): the data are extracted a the PRDoH's website ("Bio Portal") that has a web form for labs to report test results. So I think these are probably a mix of data entry errors by users and bugs in that web UI that allowed weird values to be entered.
Oops, I hit "Close issue" by accident.
@Nosferican Do you have a link to that EULA?
@Nosferican Do you have a link to that EULA?
It's under the terms and services agreement when signing up for an account at the bioportal.
I am not a lawyer nor a professional translator, but I'll take a stab at this specific section:
Intellectual property
Most of the content of our web sites is public domain and doesn't include copyright or other intellectual property statements.
The government's information is normally public domain. Public domain information can be used and copied freely. However, you're authorized to use our contractor's computer software and the related database only for informational ends and not for direct or indirect commercial purposes. Also, other materials in the web sites and/or services that is not government material cannot [sic] be copyright protected and cannot be used for any direct or indirect commercial purpose. Any use not authorized in this document is forbidden. Any copy made of materials must conserve all copyrights and other warnings. Except as explicitly provided in these Terms of Use, you may not reproduce, modify, publish, transmit, show, produce, distribute, disseminate, disseminate [sic] or disseminate any materials to third parties, nor participate in the transfer or sale, create derived works or in any manner exploit the contents of the web sites and/or services of the PRDoH or in any part of these without prior written consent by PRDoH.
I've put "[sic]" on some bits that rendered literally but understand are flat out wrong. In particular, the part that says that "other materials in the web sites and/or services that is not government material cannot [sic] be copyright protected and cannot be used for any direct or indirect commercial purpose" is just contradictory and I think it is intended to read "may."
I think one important piece of context is that this Bioportal web site is normally used as a web UI for:
And they've apparently added a public API endpoint to publish the testing data that the system contains.
Original Spanish:
Propiedad intelectual
La mayor parte del contenido de nuestros sitios web es de dominio público y no incluye avisos de derechos de autor u otros avisos de propiedad intelectual.
La información del gobierno normalmente es de dominio público. La información de dominio público se puede distribuir y copiar libremente. Sin embargo, está autorizado a utilizar el software de la computadora de nuestro contratista y la base de datos relacionada solo con fines informativos y no con fines comerciales directos o indirectos. Además, otro material en los sitios web y / o servicios que no sea material del gobierno no puede estar protegido por derechos de autor por entidades privadas y no puede usarse para ningún propósito comercial directo o indirecto. Cualquier uso no autorizado en este documento está prohibido. Cualquier copia que haga de los materiales debe conservar todos los derechos de autor y otros avisos. Salvo lo dispuesto expresamente en estos Términos de Uso, no puede reproducir, modificar, publicar, transmitir, mostrar, realizar, distribuir, difundir, difundir o transmitir ningún material a terceros, ni participar en la transferencia o venta de, crear trabajos derivados o de cualquier manera explotar el contenido de los sitios web y / o servicios del DSPR o cualquier parte de estos sin el previo consentimiento por escrito del DSPR.
In this case, a simple written notice from the copyright holder / publisher saying the generic catch-all EULA doesn't apply to the dataset and that it can be used freely should suffice.
The generic boilerplate seems to suggest the API may not be that well maintained. However, if we can get a contact person / data steward for basic questions such as the ones @space-buzzer mentioned, that would be great. The data curation can occur upstream but that still requires some basic data lineage / governance information. @rafalab might be able to help with making that initial contact with the dept.
Observations from playing a little bit with the API:
createdAt
field might be more reliable than reportedDate
. With two downloads that I've done about 8 hours apart on 2020-07-17 and 2020-07-18, I'm noting some differences in aggregates for createdDate
and reportedDate
between 2020-07-09 to 2020-07-16 (inclusive). Looks like there is some lag such that data with reportedDate
in this range was still being added to the data set on the night of the 17th to 18th.createdAt
field seems to be UTC.Hi @Nosferican,
@gabynevada can help clarifying some of the provided data he is the lead maintainer of the public API.
Good day.
I set a cron
job to download it daily, but noting will happen until we figure out:
Hey all, I'm the project manager for the BioPortal system.
I'll start inquiring on what we need on the legal side for this to proceed.
On the data side, we're on the process of cleaning it up as much as possible.
Some answers to the questions here:
I'm available for any questions and i'll see what I can do to help in this process.
Regards
Hey @gabynevada! Thank you so much for interacting with us here. I will send you an email so we can have a better way to find out what is the status on the legal side.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions!
Any updates on this? @gabynevada, any updates on the data quality issues?
@Nosferican The test results that did not matched with the test types were verified and cleaned. We're still working on additional interfaces with Laboratory Information Systems to get as much electronic information as possible.
We're working with the legal side on the Terms and Conditions on the API. It's taking longer than expected.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions!
I believe that @space-buzzer has found a way to upload the PR API info into our auxiliary dataset after some cleanup steps. I am closing this issue. @space-buzzer if there is public documentation of this work, can you please link it here?
State or US: Puerto Rico
Just saw an informal announcement from a biostatistician collaborating with Puerto Rico's Department of Health that they now have an API for downloading test data:
Note that when I hit the latter URL with a regular browser I get an authorization failure message, but when I
wget
from it I get a 48.78M JSON data file:The file is a single big JSON list with individual records for each test. I saw an earlier version of this data the other day (it was shared as an Excel file) and there's some data quality issues. Summary of the ones I'm aware of:
result
field not normalized. I just treat the ones that matchPositive
anywhere inside as positive, rest as negative.Here are the cleanups I ended up doing on it if it's of any help: