Open fabriziomoscon opened 1 year ago
I have updated the README explaining the changes related to this issue but let me summarise it.
In this case, Airflow will be able to query Kafka to get the last locations for the vessels every 15 minutes and then we will be able to send a request to the weather web API using the vessel locations as parameters. Once we have combined both data, we can store into our data warehouse so we can add it to the S3 input bucket for our Spark jobs.
Just to take it into account, we could skip the data warehouse step, and store the result directly into S3 input bucket but i thought it was appropriate to keep the data.
According to your design, the weather API can be queries at specified intervals to ingest data into Postgres. Please consider that weather data is stored at the provider side for all surface of the globe for each 100m square each 15 minutes. Storing the full state of the API would be impossible because of the sheer amount of data to retrieve and store.
To reduce the amount of weather to store, please consider that we only need to request the web API for weather in the location of the vessels that we have obtained by kafka at the specified vessel position time.
Apply changes to your design to take this constraints into consideration.