EsperanzaCuartero commented 4 years ago

Challenge #24 - A Simple Global Air Quality Data Classification

Stream 2 - Machine-Learning and Artificial Intelligence

Goal

Simple clustering and quality control algorithm to scrutinize air quality observations from different networks worldwide.

Mentors and skills

Mentors: @miha-at-ecmwf @JohannesFlemming
Skills required
- Clustering techniques
- Data quality control knowledge
- Statistical analysis skills

Challenge description

Data

PM2.5, NO2 and ozone observations from the openAQ network (or similar).
CAMS operational forecast data for the same species and station locations

What is the current problem/limitation?*

CAMS lacks credible surface air quality observations in many parts of the world, often in the most polluted area such as in India or Africa. Some observations are available for these areas from data harvesting efforts such as openAQ but there is no quality control applied to the data, and it is often not well known if the observations are made in a rural, urban or heavily polluted local environment. This information on the environment is important because the very locally influenced measurements are mostly not representative for the horizontal scale (40 km) of the CAMS forecasts and should therefore not be used for the evaluation of the CAMS model.

What could be the solution?*

Use AI clustering techniques to identify classes in the observed AQ data.
Identify outliers in the data set and consult with CAMS experts if they are erroneous data.
Check classification with meta data such as population statistics.
Derive similar cluster from the modelled data and compare against classification derived from the observed data

Further directions

Investigate the potential to improve CAMS forecast for major cities worldwide using the information from those observations

ayushprd commented 4 years ago

Hi, I am Ayush, a Computer Science undergrad from India. This project seems very fascinating to me, and I wish to work on it. Just to confirm, we need to classify and find outliers in the AQ data available from third-party sources like OpenAQ not the CAMS data itself right?

JohannesFlemming commented 4 years ago

Hi Ayush, Thank you for you interest in esowc 2020. yes that is the right. The classification is for the observations available from openAQ. (The classification will be a necessary step to utilise those data in CAMS.) It would be a further idea to consider if the gridded CAMS data can help in the classification of the openAQ data but this is not required as part of the challenge.

Please do not hesitate to ask if you have any further questions.

regards, Johannes

Johannes Flemming

Principal Scientist

Copernicus Department

European Centre for Medium-Range Weather Forecasts

Reading, UK | Bologna, Italy

e: johannes.flemming@ecmwf.int | t: +44 118 949 9837

w: ecmwf.inthttps://www.ecmwf.int/ | atmosphere.copernicus.euhttps://atmosphere.copernicus.eu/ | climate.copernicus.euhttps://climate.copernicus.eu/

[ECMWF-logo]

Any email message from ECMWF is sent in good faith, but shall neither have binding effect nor be construed as constituting a commitment by ECMWF, except where provided for in a written agreement or if explicitly stated otherwise in the content of such an email. Please note that any views or opinions presented in this email are solely those of the sender and do not necessarily represent those of ECMWF or its Member States. This message and any attachments are intended for the sole use of the addressee(s) and may contain confidential and privileged information. Any unauthorised use, disclosure, dissemination or distribution (in whole or in part) of its contents is not permitted. If you received this message in error, please notify the sender and delete it from your system.

From: Ayush Prasad notifications@github.com Sent: 03 March 2020 16:42 To: esowc/challenges_2020 challenges_2020@noreply.github.com Cc: Johannes Flemming Johannes.Flemming@ecmwf.int; Assign assign@noreply.github.com Subject: Re: [esowc/challenges_2020] Challenge #24 -A simple Global Air Quality Data Classification (#8)

Hi, I am Ayush, a Computer Science undergrad from India. This project seems very fascinating to me, and I wish to work on it. Just to confirm, we need to classify and find outliers in the AQ data available from third-party sources like OpenAQ not the CAMS data itself right?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHubhttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fesowc%2Fchallenges_2020%2Fissues%2F8%3Femail_source%3Dnotifications%26email_token%3DAOLP6MOGOSNGALSUXNCMKG3RFUXOFA5CNFSM4KIG377KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENUHDBQ%23issuecomment-594047366&data=02%7C01%7Cjohannes.flemming%40ecmwf.int%7Caa4e27b25ce7423d6f0b08d7bf91d406%7C21b711c6aab74d369ffbac0357bc20ba%7C0%7C0%7C637188505354806666&sdata=%2FJtuGdKzbfw4u%2BG5qwDJz925lHNZdQBscU6qJV21670%3D&reserved=0, or unsubscribehttps://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAOLP6MMFR77BM4WZROEF5CLRFUXOFANCNFSM4KIG377A&data=02%7C01%7Cjohannes.flemming%40ecmwf.int%7Caa4e27b25ce7423d6f0b08d7bf91d406%7C21b711c6aab74d369ffbac0357bc20ba%7C0%7C0%7C637188505354806666&sdata=b7O8k0yE1eUJRLycvD9LK%2BIiKtp62P%2FbfEE4Xz5itIo%3D&reserved=0.

jwagemann commented 4 years ago

Join us for our LIVE @ecmwf Summer of Weather Code Ask Me Anything session on 1 April 2020 at 2 pm (CET) (tomorrow).

Get infos first hand from the #ESoWC2020 organisers, mentors and former #ESoWC participants. ➡️Sign up

EsperanzaCuartero commented 4 years ago

Hi Miha and Johannes, During the webinar one participant asked the following questions related to your challenge:

Q.1 What is the metrics to determine success? Reducing uncertainty of the dataset . Q.2 Is one person team with AQ advisors ok?

Could you respond here? Thanks

AshishOhri commented 4 years ago

Hi, I had a question regarding the possible solutions mentioned above.

One of the possible solutions was to check classification with meta data such as population statistics. I wanted to know whether checking classification means to check any correlation of population with the formation of a cluster? For example two high population cities A and B exist in one cluster. Cities C and D with low population exist in another cluster. Then the high population cities increased the pollution hence A and B exist in one cluster and C and D in another. In more simpler terms, population and pollution are directly proportional. Please correct me if my understanding of the solution is wrong.

Thanks, Ashish Ohri

iam-Shashank commented 4 years ago

Hey, I am interested in this challenge. This is my understanding of the current problem/limitation and possible solutions:

CAMS lacks credible surface air quality observations for some areas
for these areas, data from openAQ available
2 problems with that data
- no quality control applied to the data
- not well known if the observations are made in a rural/ urban setting
this info about rural/urban is important because they reflect local factors influenced measurements ( outliers)
but these local factor based measurements ( which happen over a smaller area) aren't suitable/representative for CAMS forecasts( which happen over a far wider scale ~40km)
Hence, these OUTLIER measurements shouldn't be used for the evaluation of the CAMS model ( used for making CAMS forecasts)
Hence, aim is to identify classes and cluster them. and remove the above outliers to prevent bad forecasts

based on the above, I had some queries:

which of the following 4 CAMS datasets are relevant to our problem link, some of them are available only “On request”. I want to clear this so that I can download sample datasets of both CAMS and openAQ to get an idea about the data involved.
How will the consultation with CAMS experts take place for removing errors?

itsmohitanand commented 4 years ago

@miha-at-ecmwf and @JohannesFlemming

I am drafting my proposal and I have some queries. The goal of the project is "Simple clustering and quality control algorithm to scrutinize air quality observations from different networks worldwide."

And in a section titled as "What could be the solution?*" The first point is "Use AI clustering techniques to identify classes in the observed AQ data." When we talk about clustering here is it based on the quality control algorithm? Since data like OpenAQ are time-series data. So do we classify each station belonging to a different class varying on seasonality? I don't understand this fully? Is it possible to explain with an example exactly what does it mean by clustering/classification here, along with the temporal resolution?

Any explanation on this would be really helpful and then I can share my proposal for review in a day.

JohannesFlemming commented 4 years ago

Hello everybody,

thanks a lot for your questions. It is great to see so much interest in that task. The nature of your questions indicate that you are already on a good path to tackle the challenge.

The general objective is to find a practical classification stations for which observations are available in openAQ, especially in regions outside Europe and North-America. The classification should tell if observations of a stations are 1) regularly effected by obvious errors ( for example constant values over longer periods, longer gaps) or unrealistically high outliers (as suggestion for a threshold to remove them is welcome) and 2) to what extent the observations are representative for a larger area. The first point would be a simply split in "useful" and "not useful" stations. The second classification of the useful stations could be attributed to regimes such as "rural", "urban" or "street".

The first task is hopefully reasonably well defined task, the second is more tricky. Hence the procedure to find the winner will be a judgment by experts rather than a single numerical metric.

A suggestion for good way to estimate representativeness is to compare the observations of one station with the observations of stations in the vicinity (if available) or to correlate the observations to additional data such a population statistics (high population density = higher polution) or CAMS air quality products.

More specific answers
@AshishOhri Checking with population statistics is not compulsory but could be added as outlined above. So yes your understanding is correct.

@iam-Shashank The CAMS data you could use are the CAMS re-analysis or the operational CAMS forecast product. Both data set will be made available to you via the Atmosphere Data Store (ADS). More details will follow.

Please let us know if you are OK to retrieve the openAQ data yourselves or if assistance is required.

@melioristic The overall purpose is to get a classification type for each station. The statistical parameters of the time series (e.g. mean value, median of daily max, median of daily differences between daily maximum and minimum) will be the basis for the classification. It is the challenge to find a good way of classifying the stations for the practical application. A welcome outcome could also be a classification method that would allow to classify new stations (using their time series) that was not included in the training set.

Thanks again and do keep asking, Johannes

PS One-person teams are very welcome to take part.

itsmohitanand commented 4 years ago

Hello everybody,

thanks a lot for your questions. It is great to see so much interest in that task. The nature of your questions indicate that you are already on a good path to tackle the challenge.

The general objective is to find a practical classification stations for which observations are available in openAQ, especially in regions outside Europe and North-America. The classification should tell if observations of a stations are 1) regularly effected by obvious errors ( for example constant values over longer periods, longer gaps) or unrealistically high outliers (as suggestion for a threshold to remove them is welcome) and 2) to what extent the observations are representative for a larger area. The first point would be a simply split in "useful" and "not useful" stations. The second classification of the useful stations could be attributed to regimes such as "rural", "urban" or "street".

The first task is hopefully reasonably well defined task, the second is more tricky. Hence the procedure to find the winner will be a judgment by experts rather than a single numerical metric.

A suggestion for good way to estimate representativeness is to compare the observations of one station with the observations of stations in the vicinity (if available) or to correlate the observations to additional data such a population statistics (high population density = higher polution) or CAMS air quality products.

More specific answers @AshishOhri Checking with population statistics is not compulsory but could be added as outlined above. So yes your understanding is correct.

@iam-Shashank The CAMS data you could use are the CAMS re-analysis or the operational CAMS forecast product. Both data set will be made available to you via the Atmosphere Data Store (ADS). More details will follow.

Please let us know if you are OK to retrieve the openAQ data yourselves or if assistance is required.

@melioristic The overall purpose is to get a classification type for each station. The statistical parameters of the time series (e.g. mean value, median of daily max, median of daily differences between daily maximum and minimum) will be the basis for the classification. It is the challenge to find a good way of classifying the stations for the practical application. A welcome outcome could also be a classification method that would allow to classify new stations (using their time series) that was not included in the training set.

Thanks again and do keep asking, Johannes

PS One-person teams are very welcome to take part.

@JohannesFlemming and @miha-at-ecmwf

Thank you for such a detailed answer. It is really helpful. I have some more general questions regarding the proposal now. I am just on the verge of finishing my MSc and worked on time series data and Machine Learning. I foresee a research paper out of this proposal. Can the proposal for this contain baseline methods where we implement preexisting algorithms and in the later part focus on developing our own methodologies. Of course for the latter part, it's difficult to have a really concrete description of what will work for sure. But what will be tried will be listed. Do you think a proposal of this kind might be considered as a good proposal? Or should we focus more on just tweaking and using pre-existing models?

jwagemann commented 4 years ago

Only 4 days left to apply to be part of ECMWF Summer of Weather Code 2020. Application deadline: Wednesday, 22 April 2020 at 23:59 (BST). Submit your proposal here.

JohannesFlemming commented 4 years ago

Hi @melioristic Thank you for your question. A robust and practical solution of problem is the main criterion for defining success from our side. The choice of the method and innovation of ML methods is an a further criterion, especially if the innovations are instigated by the specifics of the problem to classify the AQ observations. And finally, concrete plans to write a paper either focusing more on the methods or on the application are always very welcome. I hope this answers your question. best regards, Johannes

itsmohitanand commented 4 years ago

Hi @melioristic Thank you for your question. A robust and practical solution of problem is the main criterion for defining success from our side. The choice of the method and innovation of ML methods is an a further criterion, especially if the innovations are instigated by the specifics of the problem to classify the AQ observations. And finally, concrete plans to write a paper either focusing more on the methods or on the application are always very welcome. I hope this answers your question. best regards, Johannes

@JohannesFlemming Thanks for your response. We are just submitting the proposal as a team of two. Will wait for the results now.

itsmohitanand commented 4 years ago

@JohannesFlemming @miha-at-ecmwf @jwagemann Just out of curiosity, are the results already out?

ECMWFCode4Earth / challenges_2020