covidatlas / li

Next-generation serverless crawler for COVID-19 data
Apache License 2.0
57 stars 33 forks source link

Notification for anyone who's ever opened a GitHub issue about our new reports. #284

Closed jzohrab closed 4 years ago

jzohrab commented 4 years ago

Hi there!

If I've tagged you in this issue, it's because you have opened up a GitHub issue in https://github.com/covidatlas/coronadatascraper/issues, so I assumed you were looking at the data files at https://coronadatascraper.com/#home. (Sorry if this is spam, or if I've otherwise already contacted you.)

We're changing our reports ... below is an email I sent to known consumers. Please read it, and let us know if you have any questions/concerns with the beta files, as shown below. Cheers, have a good one! z


Email

Hi all,

I'm part of the Covid Atlas team, formerly called Corona Data Scraper).

We're transitioning to a new project named Li, and have a beta release of the daily data files which will eventually replace the current files at https://coronadatascraper.com/#home. The new files are almost identical in format to the old, but there are some changes, so if you are currently using the files at coronadatascraper.com we'd like you to check out the betas and give us your feedback or concerns.

A summary of the new Li files and how records compare with those of Corona Data Scraper files is at https://github.com/covidatlas/li/blob/master/docs/reports.md. You can download a zip of the beta files at https://drive.google.com/file/d/1p-T35XkOBqDp4ixykHaGBF1V5UXxW9mm/view?usp=sharing. (Disregard the "preview error", and click "Download" to get the 11 MB zip.)

The betas are generated daily and stored at https://listaging-reportsbucket-1bjqfmfwopcdd.s3-us-west-1.amazonaws.com/beta/latest/. Append the report name to the URL to download it (e.g., https://listaging-reportsbucket-1bjqfmfwopcdd.s3-us-west-1.amazonaws.com/beta/latest/features.json)

If you have any feedback, let us know! You can reply to this email, open an issue in GitHub, ping us on the #consumers channel on Slack -- I'm @jzohrab in there -- or suggest an edit in the shared Google doc I'll be using to collate all feedback.

I'll address feedback as I have time. Based on that, we'll select a date for us to do the cutover to the new reports. We'll likely serve the new reports at the current URL, https://coronadatascraper.com/#home, as well as at new URLs. We'll sort all of that out as we go.

Thanks very much, hope you're all doing well!

Regards, Jeff


Heads up to @1ec5 @Akuukis @BartJohnson @DavidGeeraerts @DrLeoSpaceman @DreamITSoftware @FortDigital @HerbCaudill @JeremyKulcsar-DS @JimBudde @Jord-Holt @Karthi9934 @MVLBAct @Motiv8foru @NateBaldwinDesign @NickSto @PeterBloomingdale @Pranjalya @Rathna-K @S-Wallace-OH @ScottWOlson @Wikunia @a0s @abdoulsn @akkana @alangwilson @aledettaale @alexmill @anddon333 @anuragsodhi @appastair @arnt @arorapankaj @asarfati1985 @camjc @caticoa3 @catshark9 @cburkins @ch3ft0ny @chfritz @chschoenenberger @chunder @ciscorucinski @clausgp @cristipp @davegotz @debusklaneml @deepak3081996 @dhicks 👋

jzohrab commented 4 years ago

Additional heads ups:

@divisia @dkulp2 @dmedwards @dotysan @edend10 @edwlook @ekaterinakuzmina @elaborative @emclain @ersaurabhex @firebuggirl @fordmaxson @ghop02 @gitcnd @greg-minshall @gschmeckpeper @handcoding @hannahklauber @hannahleeCAN @heatxsink @hyperknot @inspectordanno @jacobmcgowan @jeremyruple @jgehrcke @jocooper43016 @joliss @joshuaellinger @jsomer @judepayne @kabeerAhmed09 @kb1ujs @kdn88 @kendonB @ktice @ladeane00 @lazd @ldtcooper @leon-wong1949 @loftusa @lori99-data @mark-otaris @martiL @michaelvacosta @microprediction @mightybyte @mikelehen @ms-jdow @ndom91 @ngolosov

jzohrab commented 4 years ago

Additional heads ups:

@nilslindemann @ntranisi @okamoun @oneviewdata @pablocarreraest @paul-em @piccolbo @pkaplan2524 @praging @prem121121 @pwrose @qgolsteyn @rafaelsabino @raysalem @razumovs @rcoenen @ret394 @reyemtm @rg3h @roboter202 @roboton @rtwfroody @ryanblock @sagarkulkarny @shaperilio @skent259 @slezakbs @sorny92 @ssljivar @stevenqzhang @strawberry-code @subzero79 @tautme @timsifive @tlampe0615 @trainh2o2 @travishaby @veyEskelson @vickyrathee @webstergl @weslawson @yoctozepto @zbraniecki

rg3h commented 4 years ago

Hi Jeff, I want to express how much I appreciate your team's effort on this. I look forward to learning more about the new report and data format.

I wrote several node/npm tools to process the data into json objects that are useful for my specific reports. Happy to share the process if you are interested in a video conference to discuss. Due to the recent errors I was seeing in the atlas reports as you transition, I started investigating JHU and other APIs (this api has some nice json objects: https://corona.lmao.ninja/docs/ and perhaps could be informative to your design) and writing processors for these data sources (JHU is in csv, separates the US, is missing some data, lacks a consistent id for each region, etc).

As a UX and frontend end architect, once the data is ETL'd, I am working on an information-centric, an analytic centric, and a mobile version, but they are not quite ready since I need to move to a new API (from atlas, JHU, or another one).

Happy to share my design/thoughts/current state and, again, thanks so much for your team's work! Rich (rich.gossweiler@gmail.com)

jzohrab commented 4 years ago

Hi @rg3h , thanks for all of the great feedback! Yes, that would be interesting to see how you're using the data.

Apologies for the errors but we're starving for available resources, and transitioning the reports is the right thing for us to be working on at the moment.

That's a good link, thanks. We have a primitive draft doc for an API (https://docs.google.com/document/d/1Rdcy0D9C2jZvpgdH4-cOxybOkgiiWTlCUn_RsnAUOKQ/edit?pli=1#) which we will likely use as the source for our reports eventually, but the reports took precedence.

Can you get on our Slack, and give me a shout? I'm @jzohrab in there. Cheers! z

rtwfroody commented 4 years ago

It looks like the new data only goes back to May 24. Is there a plan to get the old data into this new format as well?

jzohrab commented 4 years ago

Hi @rtwfroody - can you tell me where this is happening? This could be simply an error, we need to add more verification. Everything should be there. (fyi @joliss). Thanks for the heads up! z

rtwfroody commented 4 years ago

Looking at https://listaging-reportsbucket-1bjqfmfwopcdd.s3-us-west-1.amazonaws.com/beta/latest/timeseries-byLocation.json Data for the US and China start at 2020-05-24, although e.g. Zurich goes all the way back to 2020-02-26.

jzohrab commented 4 years ago

Thanks @rtwfroody , we'll check it out (fyi @joliss).

jzohrab commented 4 years ago

Hi all, @rtwfroody noted that the reports were missing data in issue 294. The data was missing in staging, but we've promoted the report generation to production and it looks like the data is more complete. The new reports are in the production bucket:

https://liproduction-reportsbucket-bhk8fnhv1s76.s3-us-west-1.amazonaws.com/beta/latest/{FILENAME}

where {FILENAME} =

I'm continuing working through issues. Cheers all!

dkulp2 commented 4 years ago

Perhaps related to #302, "tested" starts on 2020-07-09 for almost every US state.

ts <- read_csv('https://liproduction-reportsbucket-bhk8fnhv1s76.s3-us-west-1.amazonaws.com/beta/latest/timeseries.csv', col_types=cols_only(level='c',city='c',county='c',state='c',country='c',population='d',date='D',cases='d',deaths='d',tested='d'))
> filter(ts, country=='United States' & level=='state' & !is.na(tested)) %>% group_by(state) %>% summarize(first_date=first(date)) %>% group_by(first_date) %>% tally()
# A tibble: 6 x 2
  first_date     n
  <date>     <int>
1 2020-03-03     1
2 2020-03-21     1
3 2020-05-24     2
4 2020-05-25     1
5 2020-05-26     1
6 2020-07-09    46

This is using the production bucket referenced above. Is it a clue that running this on the staging bucket URL that you shared earlier shows the same 46 states all starting their testing counts on 2020-06-21?

Is the plan to provide the full history of test counts? Any ETA? Don't mean to be a pain, but since test counts are missing from the old scraper, too, I'm looking for some guidance. Thanks!

jzohrab commented 4 years ago

Hi @dkulp2 , thanks very much, I'll look into this later today. Great to have more eyes on the data, appreciate the time! Perhaps the sources didn't report testing data? Not sure, need to check, opened #313 to record notes. Cheers, jz

zbraniecki commented 4 years ago

Not sure if I should start reporting new issues for Li yet, but https://liproduction-reportsbucket-bhk8fnhv1s76.s3-us-west-1.amazonaws.com/beta/latest/timeseries-byLocation.json from today misses country Poland.

EDIT by @jzohrab : this is fixed now, thanks!

jzohrab commented 4 years ago

Thanks @zbraniecki - yes please do report them in issues.

@dkulp2 - I've rectified the lack of "tested" items -- see issue #313. I've found a different issue which I'll link here.

jzohrab commented 4 years ago

Considering disabling or compressing timeseries-tidy.csv.gz generation (issue #323). Will update docs and here if I do so.

Update: We now have timeseries-tidy-small.csv and locations.csv, which will take the place of the old report (which always crashes). Disabling -tidy.csv.gz now.

jzohrab commented 4 years ago

Adding @acertas to ticket for visibility.

jzohrab commented 4 years ago

Adding @mikelehen from the CDS issue 1062 mentioned above. 👋

rtwfroody commented 4 years ago

What's the status of these reports? Are they still beta? Can I assume they're updated daily?

jzohrab commented 4 years ago

Still beta at the moment. (I need to add roll ups for countries, and port a few remaining scrapers from the earlier project — only a few of them.) But they are stable and are regenerating every couple of hours.

One potential schemes change we may want to make: One person had requested that we change the name of the “timeseries” child element back to “date”. Then I would probably also change “timeseriesSource” to “dateSource”. Thoughts?

With that change done, these would become the v1 release reports. I have a few changes I would make for v2, but we would continue to run v1.

Cheers! Jz

El El sáb, jul. 18, 2020 a la(s) 1:38 p. m., Tim Newsome < notifications@github.com> escribió:

What's the status of these reports? Are they still beta? Can I assume they're updated daily?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/covidatlas/li/issues/284#issuecomment-660516084, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMPWDOCNINQGVHD4DA7C33R4HM27ANCNFSM4ORVMEVQ .

rg3h commented 4 years ago

Is it date or dateList? (side note: I find "date" and "dates" too similar so I usually name my lists to avoid subtle typos). If it is a list of dates, I prefer dateList to timeSeries.

There are two hard things in computer science: cache invalidation, naming things, and off-by-one errors. — Leon Bambrick

On Sat, Jul 18, 2020 at 11:56 AM JZ notifications@github.com wrote:

Still beta at the moment. (I need to add roll ups for countries, and port a few remaining scrapers from the earlier project — only a few of them.) But they are stable and are regenerating every couple of hours.

One potential schemes change we may want to make: One person had requested that we change the name of the “timeseries” child element back to “date”. Then I would probably also change “timeseriesSource” to “dateSource”. Thoughts?

With that change done, these would become the v1 release reports. I have a few changes I would make for v2, but we would continue to run v1.

Cheers! Jz

El El sáb, jul. 18, 2020 a la(s) 1:38 p. m., Tim Newsome < notifications@github.com> escribió:

What's the status of these reports? Are they still beta? Can I assume they're updated daily?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/covidatlas/li/issues/284#issuecomment-660516084, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAMPWDOCNINQGVHD4DA7C33R4HM27ANCNFSM4ORVMEVQ

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/covidatlas/li/issues/284#issuecomment-660525731, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA5THQXTB5QFHIJYAWCISE3R4HV4RANCNFSM4ORVMEVQ .

jzohrab commented 4 years ago

Hi all, a few notes:

When the above are done and verified in staging and beta production, I'll promote those reports to something like bucket-name/v1/, e.g. https://listaging-reportsbucket-1bjqfmfwopcdd.s3-us-west-1.amazonaws.com/v1/latest/timeseries-byLocation.json. Then I'll change the reports at coronadatascraper.com to use these new reports, which is a big step in the right direction for us.

In future we'll likely have some human-readable endpoint instead of that ugly bucket name. We will also have some kind of schema verification so that the data structure and format doesn't change for v1, and any further report changes will go into a future version v2.

rg3h commented 4 years ago

No worries -- I thought you were referring to a variable name in the json, not the filename. I used to ETL the data into covidByLocation and covidByDate, but now find that covidByLocation is sufficient. I've added some support data and converted it to row-major arrays to reduce the size and increase the network speed, storage, and parsing. This is helpful on a mobile device, but eventually I will have a tighter more versatile level-of-detail API (getWorld, getContinentList, getCountryList, getRegionList...) probably with a dateRange parameter. This will generalize to additional data layer integration as well (epidemiology models, traffic patterns, data analytics tools, comments, etc).

When we talk, I can go through some of the ETL (dealing with disputed borders, country name changes, creating IDs for everything, etc). Thanks again for your awesome work!

On Sun, Jul 19, 2020 at 10:40 AM JZ notifications@github.com wrote:

Hi all, a few notes:

  • we've added latest.json and latest.csv, which replace data.json https://coronadatascraper.com/#data.json and data.csv https://coronadatascraper.com/#data.csv from CDS (containing the latest data for each location). Those should show up in staging in a couple of hours, and when they're good I'll promote them to production.
  • I have a few small changes to make to timeseries-byLocation.json: I'll rename the "timeseries" child node back to "dates", in accordance with the old reports from CDS. Then "timeseriesSources" will be renamed to "dateSources". (Sorry @rg3h https://github.com/rg3h :-) )

When the above are done and verified in staging and beta production, I'll promote those reports to something like bucket-name/v1/, e.g. https://listaging-reportsbucket-1bjqfmfwopcdd.s3-us-west-1.amazonaws.com/v1/latest/timeseries-byLocation.json. Then I'll change the reports at coronadatascraper.com to use these new reports, which is a big step in the right direction for us.

In future we'll likely have some human-readable endpoint instead of that ugly bucket name. We will also have some kind of schema verification so that the data structure and format doesn't change for v1, and any further report changes will go into a future version v2.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/covidatlas/li/issues/284#issuecomment-660681453, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA5THQSPKXCBY2DQ53EMUODR4MVXFANCNFSM4ORVMEVQ .

jzohrab commented 4 years ago

A note to those playing along at home: in staging, I've moved the "beta" reports to "v1", eg:

https://listaging-reportsbucket-1bjqfmfwopcdd.s3-us-west-1.amazonaws.com/v1/latest/timeseries-byLocation.json.

I'm working through some front-end issues before officially launching this to production. Cheers! z

rg3h commented 4 years ago

thanks again for your hard work! Rich

On Sun, Jul 26, 2020 at 6:04 AM JZ notifications@github.com wrote:

A note to those playing along at home: in staging, I've moved the "beta" reports to "v1", eg:

https://listaging-reportsbucket-1bjqfmfwopcdd.s3-us-west-1.amazonaws.com/v1/latest/timeseries-byLocation.json .

I'm working through some front-end issues before officially launching this to production. Cheers! z

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/covidatlas/li/issues/284#issuecomment-663985485, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA5THQQW2TOPG7HQOFUB3D3R5QSWLANCNFSM4ORVMEVQ .

jzohrab commented 4 years ago

🎉 🎉 🎉 and 🦄 .

The v1 reports have been launched and are available at https://covidatlas.com/data.

The side links on https://coronadatascraper.com/#home also give some (but not all) of the reports.

Cheers all, I'm going to close this issue out. If you note any problems, please open new issues. And if you happen to know any JS devs who are available and interested to contribute, please let me know in Slack, b/c I could use some assistance in closing existing issues out! :-)

Cheers and regards to all, jz

arorapankaj commented 4 years ago

@jzohrab : Are you planning to update on the CoronaDataScraper website that you have moved the data to covidatlas.com ?

jzohrab commented 4 years ago

Hi @arorapankaj - Good idea :-) We'll likely eventually just forward those links to point to the new reports. We'll probably also shut down that site eventually, it's not quite accurate now.