Closed sunayana closed 3 years ago
@cmougan @rohaan2614 : Had a quick read of the ReadMe and it looks pretty great! I have opened rest of the issues related to data preparation and I will be working on today and would be coordinating with @GvdDool
My two cents here : I would also think of the problem as a regression problem and not a classification problem. Previous to using machine learning techniques in this field deterministic spatial interpolation was being used to find the approximation of the function to best predict socio economic index.
@cmougan @GvdDool @rohaan2614 : For the data sections I have added information about the python modules. So you can give your feedback.
Suggestion Night Time Light Data: After conversion, from hdr (native format) to GeoTIFF, the daily NTL intensity tiles are available for processing. The project area (Continental India) is covered by 7 (or 8) tiles of 10x10 degrees, or 2400x2400 cells. To match the temporal window of the project (2013-2017, 2 years around the DHS 2015 census for India) the total NTL data repository would be more than 1825 data layers (4MB per HDR / 10MB per GeoTiff images). The difference in disk size between HDR and GeoTIFF is the compression and data type, HDR files are optimised for storage, and will contain besides the light intensity values also the data quality flags. The spatial resolution of the data is 500m, and similar to the techniques used to match the OSM data to DHS clusters, a method will have to developed to aggregate the NTL to the appropriate DHS cluster. It would be recomended to use the same weighted vonoroi polygons when doing the "Zonal Statistics"; a spatial operation designed to retrieve key statistics by area (polygons) from raster images.
@GvdDool feel free to add it to the Readme!
OSM data: Clusters, not shared by both dataset - could please explain, it is not clear why a DHS cluster should be removed, they are based on the DHS data.
OSM data: Clusters, not shared by both dataset - could please explain, it is not clear why a DHS cluster should be removed, they are based on the DHS data.
@cmougan : This is something perhaps added by you.
@GvdDool feel free to add it to the Readme!
@cmougan I rather not add/edit directly in the readme - normally I work in a gdoc with comments, which makes collaboration easier.
Suggestion Night Time Light Data: After conversion, from hdr (native format) to GeoTIFF, the daily NTL intensity tiles are available for processing. The project area (Continental India) is covered by 7 (or 8) tiles of 10x10 degrees, or 2400x2400 cells. To match the temporal window of the project (2013-2017, 2 years around the DHS 2015 census for India) the total NTL data repository would be more than 1825 data layers (4MB per HDR / 10MB per GeoTiff images). The difference in disk size between HDR and GeoTIFF is the compression and data type, HDR files are optimised for storage, and will contain besides the light intensity values also the data quality flags. The spatial resolution of the data is 500m, and similar to the techniques used to match the OSM data to DHS clusters, a method will have to developed to aggregate the NTL to the appropriate DHS cluster. It would be recomended to use the same weighted vonoroi polygons when doing the "Zonal Statistics"; a spatial operation designed to retrieve key statistics by area (polygons) from raster images.
@GvdDool : Thanks, I will incorporate this change. The bit on aggregating NTL to appropriate DHS cluster, I will introduce in the Data Preparation sub-section since this is where I would introduce the voronoi / weighted voronoi approach for the first time. Hope this is alright
OSM data: Clusters, not shared by both dataset - could please explain, it is not clear why a DHS cluster should be removed, they are based on the DHS data.
I did not, it needs to be updated
Suggestion Night Time Light Data: After conversion, from hdr (native format) to GeoTIFF, the daily NTL intensity tiles are available for processing. The project area (Continental India) is covered by 7 (or 8) tiles of 10x10 degrees, or 2400x2400 cells. To match the temporal window of the project (2013-2017, 2 years around the DHS 2015 census for India) the total NTL data repository would be more than 1825 data layers (4MB per HDR / 10MB per GeoTiff images). The difference in disk size between HDR and GeoTIFF is the compression and data type, HDR files are optimised for storage, and will contain besides the light intensity values also the data quality flags. The spatial resolution of the data is 500m, and similar to the techniques used to match the OSM data to DHS clusters, a method will have to developed to aggregate the NTL to the appropriate DHS cluster. It would be recomended to use the same weighted vonoroi polygons when doing the "Zonal Statistics"; a spatial operation designed to retrieve key statistics by area (polygons) from raster images.
@GvdDool : Thanks, I will incorporate this change. The bit on aggregating NTL to appropriate DHS cluster, I will introduce in the Data Preparation sub-section since this is where I would introduce the voronoi / weighted voronoi approach for the first time. Hope this is alright
Sure, that would be the perfect place to link the sections and methods.
OSM data: Clusters, not shared by both dataset - could please explain, it is not clear why a DHS cluster should be removed, they are based on the DHS data.
@cmougan : This is something perhaps added by you.
I added that. It was misunderstanding. Feel free to edit.
- [x] @rohaan2614 : In the section The Data, subsection Demographic Health Surveys, it would be good to mention the specific datasets(which csv files) that was used to create the bar graphs or the box plots helps with reproducibility when someone else is reading the doc/code.
- [ ] @cmougan : In the section The Data, subsection Night Time Light Data, I have so far not used GEE at all but implemented data download with MODAPS client. Just to clarify was GEE indeed used for some analysis?
Its just a file and that's DHS_CLEAN.csv. I have mentioned and copied the notebook in the repo so I hope there won't be reproducibility issues.
- [x] @rohaan2614 : In the section The Data, subsection Demographic Health Surveys, it would be good to mention the specific datasets(which csv files) that was used to create the bar graphs or the box plots helps with reproducibility when someone else is reading the doc/code.
- [ ] @cmougan : In the section The Data, subsection Night Time Light Data, I have so far not used GEE at all but implemented data download with MODAPS client. Just to clarify was GEE indeed used for some analysis?
Its just a file and that's DHS_CLEAN.csv. I have mentioned and copied the notebook in the repo so I hope there won't be reproducibility issues.
But this DHS_CLEAN.csv is not present in this repository and neither is the procedure on how this was obtained.
- [x] @rohaan2614 : In the section The Data, subsection Demographic Health Surveys, it would be good to mention the specific datasets(which csv files) that was used to create the bar graphs or the box plots helps with reproducibility when someone else is reading the doc/code.
- [ ] @cmougan : In the section The Data, subsection Night Time Light Data, I have so far not used GEE at all but implemented data download with MODAPS client. Just to clarify was GEE indeed used for some analysis?
Its just a file and that's DHS_CLEAN.csv. I have mentioned and copied the notebook in the repo so I hope there won't be reproducibility issues.
But this DHS_CLEAN.csv is not present in this repository and neither is the procedure on how this was obtained. https://drive.google.com/file/d/1IL47jJQeILo_1AElKHOS-ULfRB-7-Y5k/view
I went back to the slack discussions and found the following thread, from 15 January in #discussion:
Rong Fang 1:47 AM Hi, I had went through the notebooks of the coding pipeline. I had some questions, most are about data preparation and aggregation. I appreciate it if someone in charge could answer the questions.
Rong Fang 2:10 AM Data sources: What are Census (scraped_india_census2011_housing.csv) and DHS (DHS-PROCESSED-CLEAN.csv) datasets? How are they connected? Are they from different sources? What are the spatial scales of them correspondingly? I saw the original file (IAGC72FL.csv) is at household level (2869043 houses), where was the file acquired? How is this IAGC72FL.csv file connected with the census file scraped_india_census2011_housing.csv? Data label and aggregation: I didn’t see the census dataset (processed_census_2011.csv) was used in the deep learning modeling process, so what is the purpose of labeling it using k-mean clustering? Or was it used in another modeling process? As the size, population and number of households vary in different districts, should some features (e.g. materials of the roof, cooking facilities in processed_census_2011.csv) used in the k-mean clustering be averaged by the district size, number of households or whatever the units the features represent? How was data (IAGC72FL.csv) aggregated from the household level (2869043 houses) to the household cluster level (28524 clusters)? By the variable dhs_house['hv001']? How was the variable 'hv001' coded? Does the geo-coordinate in file (DHS-PROCESSED-CLEAN.csv) mark the centers of the household clusters? Whether all the household clusters have the similar area? If so, how big the area is? Whether each household cluster (DHS-PROCESSED-CLEAN.csv) is only connected with one satellite image by the geo coordinate? If the image resolution is 10m X 10m, does each image cover the entire are of household cluster? Within each household cluster (A row in DHS-PROCESSED-CLEAN.csv), whether all the households have the same Toilet Facility, Roof Material, Electricity, Cooking Fuel, Drinking Water as they were labeled? Please let me know if the questions are clear enough? They are important for us to understand the whole process and to justify the methods. Thank you. :smiley:
Rehab Emam 5:44 AM Please Rong, have some time to understand the project or read reports because all the answers are already written there and are also in code.
@GvdDool : Thank you, in the meanwhile I also looked into Omdena's code and found the python notebook which does the cleaning. In this case I had few questions and concerns for the whole team :
@GvdDool : Thank you, in the meanwhile I also looked into Omdena's code and found the python notebook which does the cleaning. In this case I had few questions and concerns for the whole team :
- We cannot put the DHS_CLEANED.csv on the git repository due to the data sharing restriction we had to sign on to when downloading this csv file.
- Moreover if we use this file in this repository we should ask Omdena's permission for this. I will also check at my side through my commits if I am using this file somewhere.
@GvdDool : Thank you, in the meanwhile I also looked into Omdena's code and found the python notebook which does the cleaning. In this case I had few questions and concerns for the whole team :
- We cannot put the DHS_CLEANED.csv on the git repository due to the data sharing restriction we had to sign on to when downloading this csv file.
- Moreover if we use this file in this repository we should ask Omdena's permission for this. I will also check at my side through my commits if I am using this file somewhere.
sharing restriction: I am almost 99% sure you are correct, access to the original DHS data was protected with a non-distribute agreement, and regarding your second point, Omdena is working for WRI (which holds the IP), so we should ask WRI if we can use the file. My recommendation would be to add a link to the DHS site and don't mention the cleaning (especially because we were not sure this was done correctly).
Just putting this on the table in case this helps:
I got myself direct access to raw DHS data from DHS themselves back in February but it would have need a lot of work to reach the clean data stage so we took this short cut.
WRI does not have direct access to the raw data. Omdena acquired it on their behalf and the non-distribute agreement exitts between Omdena and DHS.
@GvdDool : Thank you, in the meanwhile I also looked into Omdena's code and found the python notebook which does the cleaning. In this case I had few questions and concerns for the whole team :
- We cannot put the DHS_CLEANED.csv on the git repository due to the data sharing restriction we had to sign on to when downloading this csv file.
- Moreover if we use this file in this repository we should ask Omdena's permission for this. I will also check at my side through my commits if I am using this file somewhere.
sharing restriction: I am almost 99% sure you are correct, access to the original DHS data was protected with a non-distribute agreement, and regarding your second point, Omdena is working for WRI (which holds the IP), so we should ask WRI if we can use the file. My recommendation would be to add a link to the DHS site and don't mention the cleaning (especially because we were not sure this was done correctly).
So this is what I will do then. I will scan through the repository a little later today, to fix this issue.
@GvdDool : Thank you, in the meanwhile I also looked into Omdena's code and found the python notebook which does the cleaning. In this case I had few questions and concerns for the whole team :
- We cannot put the DHS_CLEANED.csv on the git repository due to the data sharing restriction we had to sign on to when downloading this csv file.
- Moreover if we use this file in this repository we should ask Omdena's permission for this. I will also check at my side through my commits if I am using this file somewhere.
sharing restriction: I am almost 99% sure you are correct, access to the original DHS data was protected with a non-distribute agreement, and regarding your second point, Omdena is working for WRI (which holds the IP), so we should ask WRI if we can use the file. My recommendation would be to add a link to the DHS site and don't mention the cleaning (especially because we were not sure this was done correctly).
Just putting this on the table in case this helps:
I got myself direct access to raw DHS data from DHS themselves back in February but it would have need a lot of work to reach the clean data stage so we took this short cut.
WRI does not have direct access to the raw data. Omdena acquired it on their behalf and the non-distribute agreement exitts between Omdena and DHS.
ok. then we just refer to that.
- [ ] @rohaan2614 : In the section The Data, subsection Demographic Health Surveys, it would be good to mention the specific datasets(which csv files) that was used to create the bar graphs or the box plots helps with reproducibility when someone else is reading the doc/code.
- [x] @cmougan : In the section The Data, subsection Night Time Light Data, I have so far not used GEE at all but implemented data download with MODAPS client. Just to clarify was GEE indeed used for some analysis?
We only have what you did here
@GvdDool : Thank you, in the meanwhile I also looked into Omdena's code and found the python notebook which does the cleaning. In this case I had few questions and concerns for the whole team :
- We cannot put the DHS_CLEANED.csv on the git repository due to the data sharing restriction we had to sign on to when downloading this csv file.
- Moreover if we use this file in this repository we should ask Omdena's permission for this. I will also check at my side through my commits if I am using this file somewhere.
sharing restriction: I am almost 99% sure you are correct, access to the original DHS data was protected with a non-distribute agreement, and regarding your second point, Omdena is working for WRI (which holds the IP), so we should ask WRI if we can use the file. My recommendation would be to add a link to the DHS site and don't mention the cleaning (especially because we were not sure this was done correctly).
Just putting this on the table in case this helps: I got myself direct access to raw DHS data from DHS themselves back in February but it would have need a lot of work to reach the clean data stage so we took this short cut. WRI does not have direct access to the raw data. Omdena acquired it on their behalf and the non-distribute agreement exitts between Omdena and DHS.
ok. then we just refer to that.
Thought about this again: anyone whose reproducing our work will need the DHS_CLEAN.csv so what do we do about that? A suggestion comes to mind: we could clone their notebook and rewrite the code but it doesn't seem very 'neat'.
@GvdDool : Thank you, in the meanwhile I also looked into Omdena's code and found the python notebook which does the cleaning. In this case I had few questions and concerns for the whole team :
- We cannot put the DHS_CLEANED.csv on the git repository due to the data sharing restriction we had to sign on to when downloading this csv file.
- Moreover if we use this file in this repository we should ask Omdena's permission for this. I will also check at my side through my commits if I am using this file somewhere.
sharing restriction: I am almost 99% sure you are correct, access to the original DHS data was protected with a non-distribute agreement, and regarding your second point, Omdena is working for WRI (which holds the IP), so we should ask WRI if we can use the file. My recommendation would be to add a link to the DHS site and don't mention the cleaning (especially because we were not sure this was done correctly).
Just putting this on the table in case this helps: I got myself direct access to raw DHS data from DHS themselves back in February but it would have need a lot of work to reach the clean data stage so we took this short cut. WRI does not have direct access to the raw data. Omdena acquired it on their behalf and the non-distribute agreement exitts between Omdena and DHS.
ok. then we just refer to that.
Thought about this again: anyone whose reproducing our work will need the DHS_CLEAN.csv so what do we do about that? A suggestion comes to mind: we could clone their notebook and rewrite the code but it doesn't seem very 'neat'.
The reproduction will be difficult, even for us, when trying to recreate the DHS_CLEAN file, because rewriting the code doesn't guarantee identical results. @cmougan, how important is this in the report, would a simple flow diagram be enough (if there is still time to produce a diagram)?