Consistency of Identifiers (Surprise! Country/Region is now Country_Region...)

CSSEGISandData / COVID-19

Novel Coronavirus (COVID-19) Cases, provided by JHU CSSE

https://systems.jhu.edu/research/public-health/ncov/

29.11k stars 18.39k forks source link

Consistency of Identifiers (Surprise! Country/Region is now Country_Region...) #1326

Closed yetzt closed 4 years ago

yetzt commented 4 years ago

Dear whoever is at fault for this,

Please do not, never, under any circumstances rename or otherwise change the identifiers in your data. Especially not as a surprise. Especially especially when so many people rely on data to be available.

People are writing software, relying and depending on the stability of APIs and data structures. You just annoyed and frustrated many people.

Please consider this plea

(And while we're talking: removing US States and replacing them with municipal data isn't considered very nice as well. Expand your data, don't replace it. Create separate files. Open Data 101: Before you do things, ask yourself will this break things downstream? )

glennparham commented 4 years ago

Are they going to be changing this back soon??

rben01 commented 4 years ago

The removal of aggregated US states is really frustrating me. Naively, it's easy enough to just group-by state and sum cases over all of its cities, but there is nothing guaranteeing that all of a state's cities are included in the data. I'd have more confidence if all three levels of aggregation were provided in the raw data: city level, state level, and country level case numbers.

danleonard-nj commented 4 years ago

Wow - this totally broke an ETL package that took 12+ hours to write. Talk about a weird move.

chrisdane commented 4 years ago

i dont agree with the op: sometimes its unavoidable to change things so that "things will break downstream". i would argue the other way around: write things so that they dont break in such cases.

yetzt commented 4 years ago

@chrisdane It's rarely unavoidable to suddenly break things. Whoever produces high stakes data should use deprecation procedured to avoid scenarios like this. In this cade it would have been simple to put the new format in a separate path and produce the current format from the new data. If for example the US would suddenly use metric instead of imperial measures, loads of things would break, so they don't or use the alternative system in parallel.

yetzt commented 4 years ago

yet another surprise: the filenames for the timeline files were changed. this repository is not trustworthy. we can't rely on the stability of fieldnames or filenames, identifiers, date formats, ...

chrisdane commented 4 years ago

really dont understand your problem @yetzt. i can use the time_series*.csv files without any problems: https://github.com/chrisdane/COVID-19/tree/mybranch/r_plots#Germany the renaming of these files was even pinned on their issue board. you make an elephant aus ner mücke i feel =/

chrisdane commented 4 years ago

its even in their readme:

The Johns Hopkins University hereby disclaims any and all
representations and warranties with respect to the Website, 
including accuracy, fitness for use, and merchantability. 
Reliance on the Website for medical guidance or 
use of the Website in commerce is strictly prohibited.

yetzt commented 4 years ago

@chrisdane scrapers don't read pinned issues and the changes weren't even menioned in #1250

read it carefully. nothing about removing files, nothing about changing field identifiers, all changes in relation to the time series and not the daily reports. and you ask what the stakes are: the most regarded data visualisation on corona in germany went down or displayed erronous data as a result of these issues. not once, but multiple times. not a good thing to happen in a time of crisis where information is cruicial.

as i mentioned in #1615 i've migrated my code away from this repository, since it does not adhere to any quality standards or best practices on open data. when we get past this, i will share my expierience with the broader data journalism community.