2DegreesInvesting / tiltDataPipelines

MIT License
0 stars 0 forks source link

Testing integration of storage & packages #84

Closed ysherstyuk closed 1 month ago

ysherstyuk commented 11 months ago

In order to execute current data flow process, we need to integrate three packages (tiltIndicatorBefore, tiltIndicator and tiltIndicatorAfter) with the storage.

The idea is that we are able to:

All data that is required for the packages is already onboarded to the cloud storage. The data flow should be quite helpful to understand how the connection between all three packages and the data that is stored in the cloud works.

So basically what we need to do is to test the connection between the storage and the packages and make corresponding adjustments inside of the packages.

@maurolepore since you are the one who created packages we would really appreciate support from your side :)

maurolepore commented 11 months ago

@ysherstyuk can you please explain what you're asking me to do?

ysherstyuk commented 11 months ago

Hi @maurolepore let's discuss it during the tech weekly next Tuesday

maurolepore commented 11 months ago

The tiltIndicator package and the tiltIndicatorAfter package don't read or write anything. They work with data frames. You may read the data from wherever into an R data frame, then pass it to the packages.

And I confirm I can read a .csv file from an Azure storage container into an R data frame. I show that in the latest da-incubator: https://youtu.be/-HTH2ylnT7Q?si=xR6MQXU2Cvn8pkrc

Have you tried the examples in the websites of the packages? That should give you a good grasp of how the data flows into tiltIndicator and tiltIndicatorAfter.

I didn't develop or review tiltIndicatorBefore. Check with Bob or Kalash.

ysherstyuk commented 11 months ago

Thanks for your response Mauro! That is good to know and imo make things easier. Can I also read the same way parquet files? And do I understand correctly that since the packages do not support reading or writing files to be used in the tiltIndicator and tiltIndicatorAfter, I can just simply write python code to retrieve the data from the storage to be used as input to the packages? Is this how you do it also but retrieving files from the local environment?

maurolepore commented 11 months ago

The expert in Azure storage and Databricks are you :-) but what I learned developing the ds-incubator about Databricks suggests the Databricks Catalog alrady exposes some parquet files stored in Azure storage, so if you want to read a parquet file already available in the Databricks Catalog -- say the country dataset -- you could do it in R directly from the Databricks environment with something like this:

SparkR::tableToDF("raw.default.country")

This returns a spark dataframe but it should be easy to turn into a plain R dataframe. See https://github.com/2DegreesInvesting/ds.databricks4r#catalog

The tiltIndicator package is an R package so it works with R dataframes. If you want to use Python code you may pass the data frame from Python to R with the reticulate R package, but doing it in R is so easy that it sound like an unnecessary complication -- just call SparkR::tableToDF("path.to.dataset.in.the.databricks.catalog"). Also using parquet seems complicated, since the datasets are fairly small.

ysherstyuk commented 11 months ago

Hi Mauro, thanks for referring us to the video! I have watched the whole video, however I still did not see the part about passing the data from the storage to the tiltIndicator package. I see the video explains quite well about how to use R environment within Databicks, how to publish repos within Databricks as well as how to access data from the Azure storage using R code in Databricks.

What is still not very clear, is how we can run tiltIndicator package in Databricks while passing data from the storage. For example, how do we specify all files we want to pass to the package, where do we access the output files from the tiltIndicator package and are they also stored as R dataframes?

maurolepore commented 11 months ago

The ds-incubator is about Databricks, not about the tiltIndicator package. In the video I read the iris.csv dataset:

library(AzureStor)

# usethis::edit_r_environ()
# User delegation key with all 10 permissions
# AZURE_CONTAINER_SAS_TEST_MAURO="paste the SAS token here"
url <- "https://storagetiltdevelop.blob.core.windows.net/test-mauro"
sas <- Sys.getenv("AZURE_CONTAINER_SAS_TEST_MAURO")
container <- blob_container(url, sas = sas)

storage_read_csv(container, "data/iris.csv")

But that is no different than reading the real datasets that the tiltIndicator package needs. For example, the function tiltIndicator::emissions_profile() takes two datasets: companies and products. If you store them in an Azure storage container as companies.csv and products.csv, then you can read them as I show above then pass them to emissions_profile().