PIP-Technical-Team / pip_ingestion_pipeline

0 stars 0 forks source link

pipeline regression #21

Closed tonyfujs closed 3 years ago

tonyfujs commented 3 years ago

Looks like there has been a regression in the pipeline output: the estimated-means table has missing values for both the predicted_mean_ppp and the ppp columns.

cc @Aeilert @randrescastaneda

tonyfujs commented 3 years ago

There is also potentially an issue of matching survey names between what is in the interpolated-means table (cache_id) and the available surveys. I have not double checked this though.

Aeilert commented 3 years ago

Hi @tonyfujs,

A brief investigation here:

  1. It looks like all of the missing survey_mean_lcu and survey_mean_ppp are from grouped data surveys. A guess would be that something is wrong when the values from gdm.fst / gd_means are parsed to pipdm::db_compute_survey_mean() .
  2. There are some rows where the predicted_mean_ppp is missing while survey_mean_ppp is present, but this likely due to the fact that these means are interpolated between micro and grouped data surveys (and since the later is NA the interpolated mean will also be NA).
  3. The missing ppp only occurs for CHN, IND and IDN. Could be a merge problem between the PPP table and the LCU table. There has been some specific code added for these countries in pipdm::db_create_dsm_table(), but I'm not completly sure what it does.
Aeilert commented 3 years ago

Hi @tonyfujs,

After I re-ran the pipeline yesterday the problem with missing survey_mean_lcu and survey_mean_ppp values appears to have gone away. I have done some debugging in order to figure out what went wrong, but I haven't been able to pinpoint a specific issue.

Regarding the remaining missing ppp and predicted_mean_ppp values this is a result of adding national pop_data_level rows for CHN, IND and IDN. ppp is set to NA when these rows are added to the survey_means table and predicted_mean_ppp is also set to NA in pipdm:::db_finalize_ref_year_table(). (If I remember correctly we talked to Andres about not providing predicted means for interpolated rows based on aggregations.)

tonyfujs commented 3 years ago

Thanks @Aeilert Will re-run my integrations tests using the updated data, and see what happens...

randrescastaneda commented 3 years ago

Hey guys,

Did you solved this issue?

Best,

R.Andrés Castañeda

tonyfujs commented 3 years ago

Hey guys, Did you solved this issue? Best, ---------------------- R.Andrés Castañeda

Yes, @Aeilert re-ran the pipeline, and things are fine now.