Mosqlimate-project / Data-platform

Mosqlimate data Platform.
GNU General Public License v3.0
6 stars 6 forks source link

[API - Predictions] Template for forecast models #79

Open eduardocorrearaujo opened 1 year ago

eduardocorrearaujo commented 1 year ago

Despite knowing that we should deal with different classes of models since our first product is the comparison of forecast models, I think that we should create a template for the forecasts sent. My suggestion is, that every dataset should at least have the the columns: dates: with the date of the values forecasted; preds: with the predictions; lower: with the lower value of the CI (Nan if the model didn't provide a CI); upper: with the upper value of the CI (Nan if the model didn't provide a CI); adm_2: in the Brazil case the IBGE code (7 - digit) of the city forecasted; adm_1: in the Brazil case the UF code (two-letter) of the state forecasted; adm_0: ISO code of the country forecasted (BR for Brazil);

About the adm columns we would consider that the prediction refers to the biggest (between 0 and 2) column with the value filled. For example, if the model of the user forecast for all BR (aggregated), he should fill: adm_0 = "BR', adm_1 = Nan, adm_2 = Nan. If the prediction is for a specific state in Brazil (Paraná, for example): adm_0 = "BR', adm_1 = "PR", adm_2 = Nan. If the prediction is for a specific city in Brazil, Fortaleza, for example:adm_0 = "BR', adm_1 = "CE", adm_2 = 2304400.

What do you think @fccoelho?

eduardocorrearaujo commented 1 year ago

Also, I have some concerns about how this data should be sent, since json is a really flexible format.

By now, what I did was: send the predict element below: Screen Shot 2023-09-28 at 19 12 14

In the case above I had a dataset with the prediction for all the capitals of the northeast states and sent it in the same JSON (using the function in the docs). I should have sent, for each city, in a different request?

eduardocorrearaujo commented 1 year ago

If for every region that we predict (adm 0, adm 1 or adm2) we should send in a different request, maybe it would be interesting to add, **when posting the prediction in the database***, the adm value associated with the prediction ( For example, if the model of the user forecast for all BR (aggregated), he should fill: adm_0 = "BR', adm_1 = Nan, adm_2 = Nan. If the prediction is for a specific state in Brazil (Paraná, for example): adm_0 = "BR', adm_1 = "PR", adm_2 = Nan. If the prediction is for a specific city in Brazil, Fortaleza, for example:adm_0 = "BR', adm_1 = "CE", adm_2 = 2304400. In this case we could remove it from the JSON of the predictions.)

One advantage of using it is that when seeing all the predictions as below, we could have a filter to see all the forecasts related to a specific city or state, I think it would make the comparison of models easier.

Screen Shot 2023-09-28 at 19 24 57

* The parameters associated with the sending the predictions are shown here: https://api.mosqlimate.org/docs/registry/GET/predictions/#parameters_table

fccoelho commented 1 year ago

I Agree with this idea @eduardocorrearaujo , but I am not sure we can currently search by ADM level. Can we @luabida ?

Such a template needs to be clearly described in the documentation together with code snippets for Python and R so that people will follow the recommendations.

Later we can create Python and R client libraries that can analyze the JSON structure of the predictions and remind the user to adhere to the recommended template.

eduardocorrearaujo commented 1 year ago

I Agree with this idea @eduardocorrearaujo , but I am not sure we can currently search by ADM level. Can we @luabida ?

Such a template needs to be clearly described in the documentation together with code snippets for Python and R so that people will follow the recommendations.

Later we can create Python and R client libraries that can analyze the JSON structure of the predictions and remind the user to adhere to the recommended template.

This option to search by adm is not implemented. It would be necessary to add the parameter adm in the prediction object. In this case, I can see two possibilities:

fccoelho commented 1 year ago

I think we should just add a field in the prediction table called ADM_level to indicate what the geographical divisions the prediction refer to.

As to how the polygons should be identified Within the JSON, we can address this in the documentation with examples that work with our visualization library.

fccoelho commented 1 year ago

@luabida can you move forward with this solution?

eduardocorrearaujo commented 1 year ago

After issue #116, I propose that every prediction of a forecast model must have at least the following columns:

date: with the date of the values forecasted; preds: with the predictions; lower: with the lower value of the CI (Nan if the model didn't provide a CI); upper: with the upper value of the CI (Nan if the model didn't provide a CI); geocode: If ADM_LEVEL (field filled in the model registry) is equal to 2, this column contains the IBGE code (7 - digit) of the city forecasted; If ADM_LEVEL is equals 1, this column contains the UF code (two-letter) or two digits of the state forecasted; If ADM_LEVEL is equal to 0, this column contains the ISO code of the country forecasted (BR for Brazil);

Also, if, df is the dataframe with the columns above, it can be transformed into the JSON format using the code below:

df_in_json_format = df.to_json(orient = 'records', date_format = 'iso')

Furthermore, this JSON can be transformed back into dataframe using the snippet below:

import json 
json_struct = json.loads(df_in_json_format)    
df_flat = pd.io.json.json_normalize(json_struct)
df_flat.date = pd.to_datetime(df_flat.date)
df_flat.head()

With these changes, we can close this issue, what do you think, @fccoelho?

fccoelho commented 1 year ago

I think that is a good template, I would only make a requirement for the geocode to always be numeric, except for ADM_0. In GADM they have a ISO_1 variable, that for brazil, looks like this:

For municipalities GADM has the 7-digit geocode in a variable called CC_2.

eduardocorrearaujo commented 1 year ago

I think that is a good template, I would only make a requirement for the geocode to always be numeric, except for ADM_0. In GADM they have a ISO_1 variable, that for brazil, looks like this:

  • BR-AC for acre
  • BR-AM for Amazonas, etc.

For municipalities GADM has the 7-digit geocode in a variable called CC_2.

There is a number equivalent to BR-AC in the GADM?

fccoelho commented 1 year ago

I think that is a good template, I would only make a requirement for the geocode to always be numeric, except for ADM_0. In GADM they have a ISO_1 variable, that for brazil, looks like this:

  • BR-AC for acre
  • BR-AM for Amazonas, etc.

For municipalities GADM has the 7-digit geocode in a variable called CC_2.

There is a number equivalent to BR-AC in the GADM?

There is a Field called CC_1 but it is filled with NA

eduardocorrearaujo commented 1 year ago

@fccoelho I talked with Leo, and he said that his model generates predictions for macroregions. Should we add this option to the adm level as a new option?

eduardocorrearaujo commented 1 year ago

Also, Leo said his model can generate predictions by macroregion, UF, and BR and by week or year. In this case should we move adm_level and periodicity to the prediction registry instead of the model registry, or is it a specific case?

fccoelho commented 1 year ago

Also, Leo said his model can generate predictions by macroregion, UF, and BR and by week or year. In this case should we move adm_level and periodicity to the prediction registry instead of the model registry, or is it a specific case?

No, in this case, it is best that the Author registers separate instances of the "same" model for each target configuration.

eduardocorrearaujo commented 1 year ago

Also, Leo said his model can generate predictions by macroregion, UF, and BR and by week or year. In this case should we move adm_level and periodicity to the prediction registry instead of the model registry, or is it a specific case?

No, in this case, it is best that the Author registers separate instances of the "same" model for each target configuration.

Great! And about the macro-region option in the ADM_level?

fccoelho commented 1 year ago

There is no equivalence to Macro-region in GADM.org , So we need to think a little more about how to support it. Maybe if the author wants to support other geographical scales other than ADM 0, 1,2 and 3, it should leave it outside of the platform.

fccoelho commented 1 year ago

I think that is a good template, I would only make a requirement for the geocode to always be numeric, except for ADM_0. In GADM they have a ISO_1 variable, that for brazil, looks like this:

  • BR-AC for acre
  • BR-AM for Amazonas, etc.

For municipalities GADM has the 7-digit geocode in a variable called CC_2.

There is a number equivalent to BR-AC in the GADM?

no