djay / covidthailand

Thailand Covid testing and case data gathered and combined from various sources for others to download or view
126 stars 15 forks source link

💡 Feature Request: Allow dataset collection errors to occur without fatal errors. I.e. change from Fail Early to Fail Partially. #31

Closed pmdscully closed 3 years ago

pmdscully commented 3 years ago

Problem: Data sources change sometimes and currently exceptions are handled by the main python process. Such that any exception will cause runs to exit with a fatal error. (Fail early)

The feature request is:

Proposed solution:

  1. Decouple Data Collection Functions (or Chains of Functions) (e.g. per output dataset/df or per dataset source/url):
    • i.e. ensure one chain of functions handles one thing/source/dataset.
    • i.e. Break apart main call graph such that each dataset collection is independent from other dataset collections.
    • Specifically, such that one dataset collection can fail independently, but not hold up the other dataset collections.
  2. Independent Execution:, either by:
    • (i) add try-excepts into __main__, around each function call (i.e. entrance into call graph chain).
    • (ii) Or, create separate processes for each chain of functions and let main continue to handle exceptions.

Expected Outcomes:

djay commented 3 years ago

@pmdscully You are talking about running it locally right? Doing this for the plotted graphs would be difficult as many rely on multiple sources.

pmdscully commented 3 years ago

@pmdscully You are talking about running it locally right? Doing this for the plotted graphs would be difficult as many rely on multiple sources.

Hey @djay , okay, then that's going to be a challenge... (btw, I was thinking for the github-action version.)

djay commented 3 years ago

@pmdscully some of the data sources that are redundant I've already put in ability to skip those. What specific sources are you thinking?

djay commented 3 years ago

One thing I was considering is your suggestion before to make a class to contain all the data frames to make it easier to the same data to be got from multiple sources. This would also make it more explicitly how and when the source data is mixed. That might make it easier to make decisions when it's ok to have one source and not another? In general though I'm trying to make it fail so things get fixed rather have silent breakages.

On Fri, 2 Jul 2021, 20:38 Peter Scully, @.***> wrote:

@pmdscully https://github.com/pmdscully You are talking about running it locally right? Doing this for the plotted graphs would be difficult as many rely on multiple sources.

Hey @djay https://github.com/djay , okay, then that's going to be a challenge... (btw, I was thinking for the github-action version.)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/djay/covidthailand/issues/31#issuecomment-873005869, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAKFZF2XIPBL3T36FLKPZDTVW6O7ANCNFSM47UKJVMQ .

pmdscully commented 3 years ago

In general though I'm trying to make it fail so things get fixed rather have silent breakages.

That makes sense. I think that more or less, closes the feature request also.

pmdscully commented 3 years ago

@djay

One thing I was considering is your suggestion before to make a class to contain all the data frames to make it easier to the same data to be got from multiple sources. This would also make it more explicitly how and when the source data is mixed. That might make it easier to make decisions when it's ok to have one source and not another?

I see a couple of directions here...

Either,

  1. manage dataframes within complicated py-object containers (I am not aware of an alternative library for this, but one might exist) or
  2. maintain a database of sources associated to cell data values to assure data lineage, -- i.e. all data value sources are traceable.

I think you mean the data lineage issue, and if so, this is an ongoing problem. In my case, I still use the data design from the covid19.th-stat.com dataset , yet the data (since the site went down) is from me entering it. So the "source" of the entered data values changed over time. In this case, distinguishing the source is quite simple (before/after a date). Though, the sources in this repo are (I guess) getting quite complex.

Practically, this is probably easier to solve with, either:

  1. a meta table per dataframe with source-date-to/from:

    • df_identifier
    • column_name
    • to date,
    • from date,
    • url to the source.
  2. a meta column per dataframe with column-source-id + sources table: _I.e. Every column has a value and an integer id_to_source, which points to a table of sources_

    • column_name
    • column_source_id (int) -> 'sources_table' -> {'id' , 'url to the source'}

In my opinion, it's worth looking around on this issue further, as both (data lineage and py-object df merge management) are going to be quite complex to code.

djay commented 3 years ago

@pmdscully I put in a basic version of tracing, at least for the briefing data or some of it. There is a "Source Cases" col now in the new export - https://github.com/djay/covidthailand#daily-ccsa-briefings-. If there is a more automated way to trace all source of all values that would be good. I use combine_first extensively to combine the data. To do it automatically I'd need a replacement that keeps track of every single field that gets inserted.

djay commented 3 years ago

close this for now. There are probably a few things that make it hang that I can fix like the curl action but deal with those one by one. Generally the main reason it stops working is assertions when parsing isn't working and I'd prefer it to break and get fixed in those cases.