Closed pmdscully closed 3 years ago
@pmdscully You are talking about running it locally right? Doing this for the plotted graphs would be difficult as many rely on multiple sources.
@pmdscully You are talking about running it locally right? Doing this for the plotted graphs would be difficult as many rely on multiple sources.
Hey @djay , okay, then that's going to be a challenge... (btw, I was thinking for the github-action version.)
@pmdscully some of the data sources that are redundant I've already put in ability to skip those. What specific sources are you thinking?
One thing I was considering is your suggestion before to make a class to contain all the data frames to make it easier to the same data to be got from multiple sources. This would also make it more explicitly how and when the source data is mixed. That might make it easier to make decisions when it's ok to have one source and not another? In general though I'm trying to make it fail so things get fixed rather have silent breakages.
On Fri, 2 Jul 2021, 20:38 Peter Scully, @.***> wrote:
@pmdscully https://github.com/pmdscully You are talking about running it locally right? Doing this for the plotted graphs would be difficult as many rely on multiple sources.
Hey @djay https://github.com/djay , okay, then that's going to be a challenge... (btw, I was thinking for the github-action version.)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/djay/covidthailand/issues/31#issuecomment-873005869, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAKFZF2XIPBL3T36FLKPZDTVW6O7ANCNFSM47UKJVMQ .
In general though I'm trying to make it fail so things get fixed rather have silent breakages.
That makes sense. I think that more or less, closes the feature request also.
@djay
One thing I was considering is your suggestion before to make a class to contain all the data frames to make it easier to the same data to be got from multiple sources. This would also make it more explicitly how and when the source data is mixed. That might make it easier to make decisions when it's ok to have one source and not another?
I see a couple of directions here...
Either,
data lineage
, -- i.e. all data value sources are traceable.I think you mean the data lineage
issue, and if so, this is an ongoing problem. In my case, I still use the data design from the covid19.th-stat.com dataset
, yet the data (since the site went down) is from me entering it. So the "source" of the entered data values changed over time. In this case, distinguishing the source is quite simple (before/after a date). Though, the sources in this repo are (I guess) getting quite complex.
Practically, this is probably easier to solve with, either:
a meta table per dataframe with source-date-to/from
:
a meta column per dataframe with column-source-id
+ sources table:
_I.e. Every column has a value
and an integer id_to_source
, which points to a table of sources
_
'sources_table' -> {'id' , 'url to the source'}
In my opinion, it's worth looking around on this issue further, as both (data lineage
and py-object
df merge management) are going to be quite complex to code.
@pmdscully I put in a basic version of tracing, at least for the briefing data or some of it. There is a "Source Cases" col now in the new export - https://github.com/djay/covidthailand#daily-ccsa-briefings-. If there is a more automated way to trace all source of all values that would be good. I use combine_first extensively to combine the data. To do it automatically I'd need a replacement that keeps track of every single field that gets inserted.
close this for now. There are probably a few things that make it hang that I can fix like the curl action but deal with those one by one. Generally the main reason it stops working is assertions when parsing isn't working and I'd prefer it to break and get fixed in those cases.
Problem: Data sources change sometimes and currently exceptions are handled by the
main
python process. Such that any exception will cause runs to exit with a fatal error. (Fail early)The feature request is:
Proposed solution:
df
or per dataset source/url
):try-excepts
into__main__
, around each function call (i.e. entrance into call graph chain).main
continue to handle exceptions.Expected Outcomes: