covidatlas / coronadatascraper

COVID-19 Coronavirus data scraped from government and curated data sources.
https://coronadatascraper.com
BSD 2-Clause "Simplified" License
365 stars 180 forks source link

move coronavirus-data-sources to main repo #208

Closed hyperknot closed 4 years ago

hyperknot commented 4 years ago

What do you think about including it in the main repo instead of submoduling int? I believe the datasources is not that big, especially if we move to country-level-ids. It's rarely changing and would speed up the development process if it was included in the main repo.

qgolsteyn commented 4 years ago

I would support this change. @lazd what do you think?

lazd commented 4 years ago

I could be into this. Two things:

  1. Note that, we do use another submodule within coronavirus-data-sources as well, that could complicate things.
  2. We ideally want to move population data into the scrapers themselves
hyperknot commented 4 years ago

2. We ideally want to move population data into the scrapers themselves

I disagree with this part. Population data will be a tiny JSON file, much better to manage centrally. I have it globally for down to state-level, for county level we can use https://eric.clst.org/tech/usgeojson/ to have it for all counties.

lazd commented 4 years ago

@hyperknot ok, I'll default to you on that. Note that we have a CSV I pulled from census data with population. Eric's GeoJSON, which we are already using for county-level GeoJSON, does not include population data.

hyperknot commented 4 years ago

Oh, then that whole webpage is nothinig more than shp2geojson on the shapefiles? Not impressive.

OK, for county that CSV file is perfect I believe. For the other ones, I give a JSON.

lazd commented 4 years ago

@hyperknot can we roll that CSV into your repo so it can be delivered in the same manner as the state/country-level data?

hyperknot commented 4 years ago

@lazd yes, I was thinking of that. Making the counties into GeoJSON + the CSV into JSON. I'm going to submit a PR for the state level ones, that one comes after.

hyperknot commented 4 years ago

@lazd can we do this? This submoduling is breaking master now for example, that brazil file is missing from the master repo's version.

hyperknot commented 4 years ago

I've found an answer that Git can now track the master branch in a submodule, so that might be a promising solution for us: https://stackoverflow.com/a/9189815/518169

hyperknot commented 4 years ago

This SO answer mentions how to make an existing submodule auto-update, so maybe

git submodule set-branch --branch master -- coronavirus-data-sources

would work.

jzohrab commented 4 years ago

With all of what I'm about to say, I'm sure that just moving the files into this repo would be the simplest. Submodules are always a hassle.

Re tracking the branch, as long as the master branch can be guaranteed to be good, seems like this could work. I can do a demo with a sample toy project if that would help move this along, eg Aparent has Achild as an existing submodule, I update Aparent to track Achild master, and then push changes to Achild master, and see how that affects the parent. I'd need to test how this works with forks and out-of-date local repos as well. git fetch --recurse-submodules may solve all issues, I just don't know.

If this seems like a good idea to try out, perhaps someone can assign this issue to me.

There is another option which I've heard about, but have never tried: git subtree. Ref https://www.atlassian.com/git/tutorials/git-subtree. Excerpt:

Why you may want to consider git subtree
- Management of a simple workflow is easy.
- Older version of Git are supported (even older than v1.5.2).
- The sub-project’s code is available right after the clone of the super project is done.
- git subtree does not require users of your repository to learn anything new. They can ignore the fact that you are using git subtree to manage dependencies.
- git subtree does not add new metadata files like git submodule does (i.e., .gitmodule).
- Contents of the module can be modified without having a separate repository copy of the dependency somewhere else.

Drawbacks (but in our opinion they're largely acceptable):
- You must learn about a new merge strategy (i.e.git subtree).
- Contributing code back upstream for the sub-projects is slightly more complicated.
- The responsibility of not mixing super and sub-project code in commits lies with you.

As long as the subtree is only managed by a few sharp minds, it might be acceptable ... but it's yet another thing to learn, yet another thing to go wrong. With the current pace of dev and delivery, it probably just adds too much unnecessary risk.

hyperknot commented 4 years ago

This one was done recently, right?

praging commented 4 years ago

yes. has been closed. @lazd