sunayana commented 3 years ago

[ ] @rohaan2614 : In the section The Data, subsection Demographic Health Surveys, it would be good to mention the specific datasets(which csv files) that was used to create the bar graphs or the box plots helps with reproducibility when someone else is reading the doc/code.
[x] @cmougan : In the section The Data, subsection Night Time Light Data, I have so far not used GEE at all but implemented data download with MODAPS client. Just to clarify was GEE indeed used for some analysis?

sunayana commented 3 years ago

@cmougan @rohaan2614 : Had a quick read of the ReadMe and it looks pretty great! I have opened rest of the issues related to data preparation and I will be working on today and would be coordinating with @GvdDool

My two cents here : I would also think of the problem as a regression problem and not a classification problem. Previous to using machine learning techniques in this field deterministic spatial interpolation was being used to find the approximation of the function to best predict socio economic index.

sunayana commented 3 years ago

@cmougan @GvdDool @rohaan2614 : For the data sections I have added information about the python modules. So you can give your feedback.

GvdDool commented 3 years ago

Suggestion Night Time Light Data: After conversion, from hdr (native format) to GeoTIFF, the daily NTL intensity tiles are available for processing. The project area (Continental India) is covered by 7 (or 8) tiles of 10x10 degrees, or 2400x2400 cells. To match the temporal window of the project (2013-2017, 2 years around the DHS 2015 census for India) the total NTL data repository would be more than 1825 data layers (4MB per HDR / 10MB per GeoTiff images). The difference in disk size between HDR and GeoTIFF is the compression and data type, HDR files are optimised for storage, and will contain besides the light intensity values also the data quality flags. The spatial resolution of the data is 500m, and similar to the techniques used to match the OSM data to DHS clusters, a method will have to developed to aggregate the NTL to the appropriate DHS cluster. It would be recomended to use the same weighted vonoroi polygons when doing the "Zonal Statistics"; a spatial operation designed to retrieve key statistics by area (polygons) from raster images.

cmougan commented 3 years ago

@GvdDool feel free to add it to the Readme!

GvdDool commented 3 years ago

OSM data: Clusters, not shared by both dataset - could please explain, it is not clear why a DHS cluster should be removed, they are based on the DHS data.

sunayana commented 3 years ago

OSM data: Clusters, not shared by both dataset - could please explain, it is not clear why a DHS cluster should be removed, they are based on the DHS data.

@cmougan : This is something perhaps added by you.

GvdDool commented 3 years ago

@GvdDool feel free to add it to the Readme!

@cmougan I rather not add/edit directly in the readme - normally I work in a gdoc with comments, which makes collaboration easier.

sunayana commented 3 years ago

Suggestion Night Time Light Data: After conversion, from hdr (native format) to GeoTIFF, the daily NTL intensity tiles are available for processing. The project area (Continental India) is covered by 7 (or 8) tiles of 10x10 degrees, or 2400x2400 cells. To match the temporal window of the project (2013-2017, 2 years around the DHS 2015 census for India) the total NTL data repository would be more than 1825 data layers (4MB per HDR / 10MB per GeoTiff images). The difference in disk size between HDR and GeoTIFF is the compression and data type, HDR files are optimised for storage, and will contain besides the light intensity values also the data quality flags. The spatial resolution of the data is 500m, and similar to the techniques used to match the OSM data to DHS clusters, a method will have to developed to aggregate the NTL to the appropriate DHS cluster. It would be recomended to use the same weighted vonoroi polygons when doing the "Zonal Statistics"; a spatial operation designed to retrieve key statistics by area (polygons) from raster images.

@GvdDool : Thanks, I will incorporate this change. The bit on aggregating NTL to appropriate DHS cluster, I will introduce in the Data Preparation sub-section since this is where I would introduce the voronoi / weighted voronoi approach for the first time. Hope this is alright

cmougan commented 3 years ago

OSM data: Clusters, not shared by both dataset - could please explain, it is not clear why a DHS cluster should be removed, they are based on the DHS data.

I did not, it needs to be updated

GvdDool commented 3 years ago

Suggestion Night Time Light Data: After conversion, from hdr (native format) to GeoTIFF, the daily NTL intensity tiles are available for processing. The project area (Continental India) is covered by 7 (or 8) tiles of 10x10 degrees, or 2400x2400 cells. To match the temporal window of the project (2013-2017, 2 years around the DHS 2015 census for India) the total NTL data repository would be more than 1825 data layers (4MB per HDR / 10MB per GeoTiff images). The difference in disk size between HDR and GeoTIFF is the compression and data type, HDR files are optimised for storage, and will contain besides the light intensity values also the data quality flags. The spatial resolution of the data is 500m, and similar to the techniques used to match the OSM data to DHS clusters, a method will have to developed to aggregate the NTL to the appropriate DHS cluster. It would be recomended to use the same weighted vonoroi polygons when doing the "Zonal Statistics"; a spatial operation designed to retrieve key statistics by area (polygons) from raster images.

@GvdDool : Thanks, I will incorporate this change. The bit on aggregating NTL to appropriate DHS cluster, I will introduce in the Data Preparation sub-section since this is where I would introduce the voronoi / weighted voronoi approach for the first time. Hope this is alright

Sure, that would be the perfect place to link the sections and methods.

rohaan2614 commented 3 years ago

OSM data: Clusters, not shared by both dataset - could please explain, it is not clear why a DHS cluster should be removed, they are based on the DHS data.

@cmougan : This is something perhaps added by you.

I added that. It was misunderstanding. Feel free to edit.

rohaan2614 commented 3 years ago

[x] @rohaan2614 : In the section The Data, subsection Demographic Health Surveys, it would be good to mention the specific datasets(which csv files) that was used to create the bar graphs or the box plots helps with reproducibility when someone else is reading the doc/code.

[ ] @cmougan : In the section The Data, subsection Night Time Light Data, I have so far not used GEE at all but implemented data download with MODAPS client. Just to clarify was GEE indeed used for some analysis?

Its just a file and that's DHS_CLEAN.csv. I have mentioned and copied the notebook in the repo so I hope there won't be reproducibility issues.

sunayana commented 3 years ago

[x] @rohaan2614 : In the section The Data, subsection Demographic Health Surveys, it would be good to mention the specific datasets(which csv files) that was used to create the bar graphs or the box plots helps with reproducibility when someone else is reading the doc/code.

[ ] @cmougan : In the section The Data, subsection Night Time Light Data, I have so far not used GEE at all but implemented data download with MODAPS client. Just to clarify was GEE indeed used for some analysis?

Its just a file and that's DHS_CLEAN.csv. I have mentioned and copied the notebook in the repo so I hope there won't be reproducibility issues.

But this DHS_CLEAN.csv is not present in this repository and neither is the procedure on how this was obtained.

GvdDool commented 3 years ago

[x] @rohaan2614 : In the section The Data, subsection Demographic Health Surveys, it would be good to mention the specific datasets(which csv files) that was used to create the bar graphs or the box plots helps with reproducibility when someone else is reading the doc/code.

[ ] @cmougan : In the section The Data, subsection Night Time Light Data, I have so far not used GEE at all but implemented data download with MODAPS client. Just to clarify was GEE indeed used for some analysis?

Its just a file and that's DHS_CLEAN.csv. I have mentioned and copied the notebook in the repo so I hope there won't be reproducibility issues.

But this DHS_CLEAN.csv is not present in this repository and neither is the procedure on how this was obtained. https://drive.google.com/file/d/1IL47jJQeILo_1AElKHOS-ULfRB-7-Y5k/view

GvdDool commented 3 years ago

I went back to the slack discussions and found the following thread, from 15 January in #discussion:

Rong Fang 1:47 AM Hi, I had went through the notebooks of the coding pipeline. I had some questions, most are about data preparation and aggregation. I appreciate it if someone in charge could answer the questions.

Rong Fang 2:10 AM Data sources: What are Census (scraped_india_census2011_housing.csv) and DHS (DHS-PROCESSED-CLEAN.csv) datasets? How are they connected? Are they from different sources? What are the spatial scales of them correspondingly? I saw the original file (IAGC72FL.csv) is at household level (2869043 houses), where was the file acquired? How is this IAGC72FL.csv file connected with the census file scraped_india_census2011_housing.csv? Data label and aggregation: I didn’t see the census dataset (processed_census_2011.csv) was used in the deep learning modeling process, so what is the purpose of labeling it using k-mean clustering? Or was it used in another modeling process? As the size, population and number of households vary in different districts, should some features (e.g. materials of the roof, cooking facilities in processed_census_2011.csv) used in the k-mean clustering be averaged by the district size, number of households or whatever the units the features represent? How was data (IAGC72FL.csv) aggregated from the household level (2869043 houses) to the household cluster level (28524 clusters)? By the variable dhs_house['hv001']? How was the variable 'hv001' coded? Does the geo-coordinate in file (DHS-PROCESSED-CLEAN.csv) mark the centers of the household clusters? Whether all the household clusters have the similar area? If so, how big the area is? Whether each household cluster (DHS-PROCESSED-CLEAN.csv) is only connected with one satellite image by the geo coordinate? If the image resolution is 10m X 10m, does each image cover the entire are of household cluster? Within each household cluster (A row in DHS-PROCESSED-CLEAN.csv), whether all the households have the same Toilet Facility, Roof Material, Electricity, Cooking Fuel, Drinking Water as they were labeled? Please let me know if the questions are clear enough? They are important for us to understand the whole process and to justify the methods. Thank you. :smiley:

Rehab Emam 5:44 AM Please Rong, have some time to understand the project or read reports because all the answers are already written there and are also in code.

Census Was used in first phase as a ground truth with landsat images. DHS is used in second phase as ground truth to sentinel images. They were chose like that after research and reasons behind that are in Appendix of the report. And if you Google them, you'd find they are not related at all as they are surveys done by different organizations. Also they were not used together in the same code.
There is no estimation of area of a household in DHS data, where did you find that? Also census was done in 2011 and DHS was done in 2015, they can't be used together in the same estimation.
The files you mentioned with clusters are the files provided by the organizations on their websites. We have done some data wrangling and cleaning to get the final clean csv files. The code is in github. And why we chose those metrics was done after reading a huge number of papers suggesting those metrics to estimate economic well-being, like Stanford paper and quality of life, they are in references section, please have a look.
The geo data are the center of households, there is no area given if you look at the data.
The resolution of sentinel images is 10x10 m, means a Pixel corresponds to an area of 10m squared, it has nothing to do with the area we chose to get our images, which is 4x4 km as you can see in the code.
The metrics like toilet facility or roof material are categorical, there are many categories out there, there is a dashboard in the report and presentation showing all categories of the metrics and of course not all households have the same category but some households have gasoline as a cooking fuel and some have wood as cooking fuel, etc. Thank you

sunayana commented 3 years ago

@GvdDool : Thank you, in the meanwhile I also looked into Omdena's code and found the python notebook which does the cleaning. In this case I had few questions and concerns for the whole team :

We cannot put the DHS_CLEANED.csv on the git repository due to the data sharing restriction we had to sign on to when downloading this csv file.
Moreover if we use this file in this repository we should ask Omdena's permission for this. I will also check at my side through my commits if I am using this file somewhere.

GvdDool commented 3 years ago

@GvdDool : Thank you, in the meanwhile I also looked into Omdena's code and found the python notebook which does the cleaning. In this case I had few questions and concerns for the whole team :

We cannot put the DHS_CLEANED.csv on the git repository due to the data sharing restriction we had to sign on to when downloading this csv file.

Moreover if we use this file in this repository we should ask Omdena's permission for this. I will also check at my side through my commits if I am using this file somewhere.

sharing restriction: I am almost 99% sure you are correct, access to the original DHS data was protected with a non-distribute agreement, and regarding your second point, Omdena is working for WRI (which holds the IP), so we should ask WRI if we can use the file. My recommendation would be to add a link to the DHS site and don't mention the cleaning (especially because we were not sure this was done correctly).

rohaan2614 commented 3 years ago

@GvdDool : Thank you, in the meanwhile I also looked into Omdena's code and found the python notebook which does the cleaning. In this case I had few questions and concerns for the whole team :

We cannot put the DHS_CLEANED.csv on the git repository due to the data sharing restriction we had to sign on to when downloading this csv file.

Moreover if we use this file in this repository we should ask Omdena's permission for this. I will also check at my side through my commits if I am using this file somewhere.

sharing restriction: I am almost 99% sure you are correct, access to the original DHS data was protected with a non-distribute agreement, and regarding your second point, Omdena is working for WRI (which holds the IP), so we should ask WRI if we can use the file. My recommendation would be to add a link to the DHS site and don't mention the cleaning (especially because we were not sure this was done correctly).

Just putting this on the table in case this helps:

I got myself direct access to raw DHS data from DHS themselves back in February but it would have need a lot of work to reach the clean data stage so we took this short cut.

WRI does not have direct access to the raw data. Omdena acquired it on their behalf and the non-distribute agreement exitts between Omdena and DHS.

sunayana commented 3 years ago

@GvdDool : Thank you, in the meanwhile I also looked into Omdena's code and found the python notebook which does the cleaning. In this case I had few questions and concerns for the whole team :

We cannot put the DHS_CLEANED.csv on the git repository due to the data sharing restriction we had to sign on to when downloading this csv file.

Moreover if we use this file in this repository we should ask Omdena's permission for this. I will also check at my side through my commits if I am using this file somewhere.

sharing restriction: I am almost 99% sure you are correct, access to the original DHS data was protected with a non-distribute agreement, and regarding your second point, Omdena is working for WRI (which holds the IP), so we should ask WRI if we can use the file. My recommendation would be to add a link to the DHS site and don't mention the cleaning (especially because we were not sure this was done correctly).

So this is what I will do then. I will scan through the repository a little later today, to fix this issue.

sunayana commented 3 years ago

@GvdDool : Thank you, in the meanwhile I also looked into Omdena's code and found the python notebook which does the cleaning. In this case I had few questions and concerns for the whole team :

We cannot put the DHS_CLEANED.csv on the git repository due to the data sharing restriction we had to sign on to when downloading this csv file.

Moreover if we use this file in this repository we should ask Omdena's permission for this. I will also check at my side through my commits if I am using this file somewhere.

sharing restriction: I am almost 99% sure you are correct, access to the original DHS data was protected with a non-distribute agreement, and regarding your second point, Omdena is working for WRI (which holds the IP), so we should ask WRI if we can use the file. My recommendation would be to add a link to the DHS site and don't mention the cleaning (especially because we were not sure this was done correctly).

Just putting this on the table in case this helps:

I got myself direct access to raw DHS data from DHS themselves back in February but it would have need a lot of work to reach the clean data stage so we took this short cut.

WRI does not have direct access to the raw data. Omdena acquired it on their behalf and the non-distribute agreement exitts between Omdena and DHS.

ok. then we just refer to that.

cmougan commented 3 years ago

[ ] @rohaan2614 : In the section The Data, subsection Demographic Health Surveys, it would be good to mention the specific datasets(which csv files) that was used to create the bar graphs or the box plots helps with reproducibility when someone else is reading the doc/code.

[x] @cmougan : In the section The Data, subsection Night Time Light Data, I have so far not used GEE at all but implemented data download with MODAPS client. Just to clarify was GEE indeed used for some analysis?

We only have what you did here

rohaan2614 commented 3 years ago

@GvdDool : Thank you, in the meanwhile I also looked into Omdena's code and found the python notebook which does the cleaning. In this case I had few questions and concerns for the whole team :

We cannot put the DHS_CLEANED.csv on the git repository due to the data sharing restriction we had to sign on to when downloading this csv file.

Moreover if we use this file in this repository we should ask Omdena's permission for this. I will also check at my side through my commits if I am using this file somewhere.

sharing restriction: I am almost 99% sure you are correct, access to the original DHS data was protected with a non-distribute agreement, and regarding your second point, Omdena is working for WRI (which holds the IP), so we should ask WRI if we can use the file. My recommendation would be to add a link to the DHS site and don't mention the cleaning (especially because we were not sure this was done correctly).

Just putting this on the table in case this helps: I got myself direct access to raw DHS data from DHS themselves back in February but it would have need a lot of work to reach the clean data stage so we took this short cut. WRI does not have direct access to the raw data. Omdena acquired it on their behalf and the non-distribute agreement exitts between Omdena and DHS.

ok. then we just refer to that.

Thought about this again: anyone whose reproducing our work will need the DHS_CLEAN.csv so what do we do about that? A suggestion comes to mind: we could clone their notebook and rewrite the code but it doesn't seem very 'neat'.

GvdDool commented 3 years ago

@GvdDool : Thank you, in the meanwhile I also looked into Omdena's code and found the python notebook which does the cleaning. In this case I had few questions and concerns for the whole team :

We cannot put the DHS_CLEANED.csv on the git repository due to the data sharing restriction we had to sign on to when downloading this csv file.

Moreover if we use this file in this repository we should ask Omdena's permission for this. I will also check at my side through my commits if I am using this file somewhere.

sharing restriction: I am almost 99% sure you are correct, access to the original DHS data was protected with a non-distribute agreement, and regarding your second point, Omdena is working for WRI (which holds the IP), so we should ask WRI if we can use the file. My recommendation would be to add a link to the DHS site and don't mention the cleaning (especially because we were not sure this was done correctly).

Just putting this on the table in case this helps: I got myself direct access to raw DHS data from DHS themselves back in February but it would have need a lot of work to reach the clean data stage so we took this short cut. WRI does not have direct access to the raw data. Omdena acquired it on their behalf and the non-distribute agreement exitts between Omdena and DHS.

ok. then we just refer to that.

Thought about this again: anyone whose reproducing our work will need the DHS_CLEAN.csv so what do we do about that? A suggestion comes to mind: we could clone their notebook and rewrite the code but it doesn't seem very 'neat'.

The reproduction will be difficult, even for us, when trying to recreate the DHS_CLEAN file, because rewriting the code doesn't guarantee identical results. @cmougan, how important is this in the report, would a simple flow diagram be enough (if there is still time to produce a diagram)?

cmougan / WRI_WellBeing_Data_Layer

[DOC] : Comments on the different sections #8