capstone-coal / pycoal

Python toolkit for characterizing Coal and Open-pit surface mining impacts on American Lands
http://capstone-coal.github.io/
GNU General Public License v2.0
28 stars 14 forks source link

research paper #27

Closed ghost closed 6 years ago

ghost commented 7 years ago

this issue is to coordinate collaboration on our research paper. this is distinct from the papers and presentation we will be giving for capstone, though there may be much we can reuse.

the conferences we have discussed include AGU, ESIP, and SPIE. i'm not really familiar with what these conferences are looking for. we are building off of a lot of existing literature and software, so we should probably focus on what's unique about our project.

resources to consult for this include the mailing list, source code and issues on GitHub, meeting notes, and Google Drive.

an edit link to a template on overleaf has been posted. since everything is public, be aware that we may need to depend on the revisions feature to revert any malicious or otherwise unwanted changes. (wikipedia anyone?)

i suggest we come up with some kind of outline and maybe a list of references, even if it just plain text, and work on typesetting later on.

we may prefer to discuss this over the mailing list. can't get much done if GitHub gets DDoS'd again.

lewismc commented 7 years ago

Hi @xiaomei7 I would like to work with you on this issue.

lewismc commented 7 years ago

Here is the overleaf. https://www.overleaf.com/8121520yzjysywvdkhb#/28676449/

lewismc commented 7 years ago

Hi @xiaomei7 the paper I was talking about at todays meeting can be found here. We can use the structure for our paper. See here for the PDF version. Thanks

lewismc commented 7 years ago

Hi @xiaomei7 are you comfortable making a start on the paper? It would be good if we were in a position to discuss structure and content this coming Friday. Please let me know. Thank you

xiaomei7 commented 7 years ago

HI, @lewismc I'm ready to start on the paper.

lewismc commented 7 years ago

Ok great, can you please start on the structure? Take it directly from the other paper I provided a link to. You can also start in the introduction, basically providing context for the project and the structure of the paper.

On Wed, Apr 12, 2017 at 10:06 AM xiaomei7 notifications@github.com wrote:

HI, @lewismc https://github.com/lewismc I'm ready to start on the paper.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/capstone-coal/pycoal/issues/27#issuecomment-293644889, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHJl4W81e_tALzsQwWmaBlKQaP61qvWks5rvQSugaJpZM4L-Fxu .

--

Lewis Dr. Lewis J. McGibbney Ph.D, B.Sc Director, MCMA Associates Phone: +1(626)498-3090 Skype: lewis.john.mcgibbney Email: lewis.mcgibbney@gmail.com

ghost commented 7 years ago

one thing we should mention is why we ended up using spectral angle mapper classification rather than neural networks like we originally thought based on the Benediktsson, et. al. paper.

the responses to heidi's question provide some good details. in addition, the maintainer recommended against using the spectral python implementation in favor of a more mature machine learning library. the main issue is that we only have a single sample per class, whereas NNs are better with lots of samples. one way of getting around this would be to group spectral library classes together, for example to merge all the coal classifications into one. another way described by our TA and the spectral python maintainer is to perturb the data.

there's also the lecture we talked about which described active learning classification approaches.

i think the main problem with machine learning in the context of our problem is that we don't know what the right answers are with which to train. we are also hindered by the computational intensity of machine learning when applied to large datasets on small machines.

the spectral angle approach does produce a valuable novel data set that can be mined for mines and many other things, although the data is much less granular than i had hoped. one way of getting around that would be to write out not only the first classification per pixel, but several.

ghost commented 7 years ago

the performance issue details our server specifications and some initial timing results.

xiaomei7 commented 7 years ago

@browtayl Thanks for the updates.

ghost commented 7 years ago

no prob, let me know if i can do anything to help with this over the next few weeks.

ghost commented 7 years ago

here are some resources on citing usgs, aster, and xsede.

lewismc commented 7 years ago

@xiaomei7 please scrap the thought of us submitting this manuscript to the geosci-model-deb opne access journal (publication fees are ~$75 per page)... Instead we will submit this to Elsevier's Computers and Geosciences. Here is an author information pack. They have also provided this Github repository to give us more information on the publication itself https://github.com/CAGEO We can still use the Overleaf I've referenced above.

xiaomei7 commented 7 years ago

@lewismc Thanks. BTW, do you want me just write on the Overleaf, like replace the original content?

lewismc commented 7 years ago

Yes please Thank you

On Fri, Apr 14, 2017 at 3:10 PM xiaomei7 notifications@github.com wrote:

@lewismc https://github.com/lewismc Thanks. BTW, do you want me just write on the Overleaf, like replace the original content?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/capstone-coal/pycoal/issues/27#issuecomment-294246186, or mute the thread https://github.com/notifications/unsubscribe-auth/ABHJlzYm0F6tilG9Go-H7NdFSMcw99QCks5rv-69gaJpZM4L-Fxu .

--

Lewis Dr. Lewis J. McGibbney Ph.D, B.Sc Director, MCMA Associates Phone: +1(626)498-3090 Skype: lewis.john.mcgibbney Email: lewis.mcgibbney@gmail.com

ghost commented 7 years ago

here's info on citing the national hydrography dataset and the national elevation dataset.

ghost commented 7 years ago

...and the national transportation dataset.

ghost commented 7 years ago

What's the timeframe for this paper? Currently the milestone is set at 1.0, when our development for the course ceases on 2017-05-12. However, I for one would be willing to contribute during the month remaining from Expo to the end of the term on 2017-06-17. This might take some of the pressure off. We did indicate in our Requirements Document that the conference paper deadline was after the release date. If so, we could bump this up to post-1.0.

lewismc commented 7 years ago

@browtayl we need to start working on it... 2 weeks ago. These things take time and as I am very busy it will take longer than usual. I am not sure how much experience you all have writing papers, but it is far from easy. It is very time consuming and we need to get cracking on it. @xiaomei7 anything happening here? If you need some assistance then please let me know here or else let me know offline and I can guide you. As far as I know this is your only task right now so we need to make sure that progress is being made. Thank you @xiaomei7

xiaomei7 commented 7 years ago

@lewismc Hi, I'm trying to refine some sentences that I wrote or copied from some introduction paragraph from our main page of pycoal. But I'm kind of confused by the difference between abstract and the introduction.

lewismc commented 7 years ago

If you read the last paper I referenced, then you will see what the structure is and what content goes where. Please never mind the Latex template, just start a google doc and I will contribute the material alongside you. Thank you @xiaomei7

xiaomei7 commented 7 years ago

@lewismc Thanks, I would post all of my progress no later than tomorrow!

ghost commented 7 years ago

@lewismc understood, the heat is still on. my specific question is what is the desired target date for a final draft of the paper? expo is on 2017-05-19 and the end of the term is 2017-06-17.

@xiaomei7, @heidiaclayton, and i will make sure this gets done, and done well. our TA reminded us today that this was one of our "stretch goals". although it will be a stretch to pull it together, having a really solid publication will make it all worth it.

@lewismc do you have a specific thesis in mind, an angle you want to approach the paper from? this project covered a lot of ground: 1) a review of hyperspectral image classification using spectral libraries by means of machine-learning (e.g., multi-layer perceptron classification and active feedback learning) and numerical/statistical methods (e.g., [modified] spectral angle mapper classification, our present algorithm); 2) software development with existing libraries (e.g., numpy, spectral, gdal) for converting orthocorrected, scaled-reflectance imagery to a variety of geographic raster representations for environmental data science; 3) eventual bulk processing of AVIRIS imagery to precompute classified data products on a large scale for rapid arbitrary research utilizing land surface data; and 4) our actual environmental study which has produced some really interesting results that 1) confirm the hypothesis that spectral classification can identify signatures of coal mining in imagery of active mines, 2) show that imagery of a particular mine accused of environmental impacts tests positive for these signatures, and 3) confirms that correlating mine classifications, hydrography data, and (hopefully) water quality data can improve the environmental understanding.

in terms of publishers, we might also consider the Public Library of Science which is a really nice open-access journal, though i don't know if it's the best fit for our project.

xiaomei7 commented 7 years ago

@lewismc Hi, I already posted some structure on the overleaf based on my understanding about the paper; but we need more information about what @browtayl has mentioned above to make more progress.

Thanks!

lewismc commented 7 years ago

Hi @xiaomei7 thank you for making a start on the Overleaf... for the time being, can you just transition this over to a Google Doc and we can contribute there? I think overleaf is overkill at this stage. Thank you

ghost commented 7 years ago

@lewismc @xiaomei7 created a google doc in the capstone-coal google drive directory which I renamed to "Research Paper Draft".

ghost commented 7 years ago

see this note from the AVIRIS site regarding terminology:

Please note that when referring to/requesting AVIRIS data, we are working to use the terms "imaging spectroscopy" and "imaging spectrometer data" rather than "hyperspectral." This allows us to communicate more clearly with our physics, chemistry, and biology science colleagues.

ghost commented 7 years ago

Once I'm out of Code Freeze Hell I will propose an abstract and outline. In the meantime, any resources on research paper writing or feedback on my https://github.com/capstone-coal/pycoal/issues/27#issuecomment-295614786 would be beneficial.

xiaomei7 commented 7 years ago

I was adding contents every day for past week or so, but I'm not sure I'm on the right track or not. Feedback is much appreciated.

lewismc commented 7 years ago

I've added a lot of structure and guidance to the paper outline. Please review and start hacking away if and where you feel you want to. Once I get a good opportunity to do so, I will hammer my way through it as well. Thank you @xiaomei for bootstrapping the paper and also thank you to you @ browtayl for adding content as well.

ghost commented 7 years ago

Dropping a reference here to the screenshots directory on Google Drive which contains all of the images I've discussed throughout multiple email and GitHub threads. Let me know if it is desired to generate a print-quality version of any of these.

I recently uploaded some interesting visible light + mining classified screenshots that further build confidence in this algorithm: the sludge classifications in the tailing ponds, a bright-green pool hidden in the hillside, and several more tailing ponds with evident leachate.

ghost commented 7 years ago

And here's a view of the power plant which we have seen before.

ghost commented 7 years ago

The USGS Digital Spectral Library 07 was just released on 2017-04-10: https://speclab.cr.usgs.gov/spectral-lib.html#current https://pubs.er.usgs.gov/publication/ds1035

We have been using the 06 version. Be interesting to see what they added.

ghost commented 7 years ago

@lewismc No meaningful progress is being made here. If there are no substantive changes by 2017-05-19 I will unassign @xiaomei7 and take this over.

ghost commented 7 years ago

Here are some interesting resources regarding AVIRIS operations after the Deepwater Horizon oil spill: https://aviris.jpl.nasa.gov/html/gulfoilspill.html https://pubs.usgs.gov/of/2010/1101/ https://pubs.er.usgs.gov/publication/70036149 https://www.researchgate.net/publication/261076420_Oil_Slope_Index_An_algorithm_for_crude_oil_spill_detection_with_imaging_spectroscopy

It might be interesting to try to apply COAL to some of the oil slick imagery to demonstrate an unrelated application. Those papers might also have some good general information or inspiration for our own.

xiaomei7 commented 7 years ago

Question: is our development style agile?

lewismc commented 7 years ago

yes I would say it has been pretty agile. I mean we have not followed the methodology to the T, however generally speaking development has been pretty agile in nature.

ghost commented 7 years ago

I'm going to try to systematically review our correspondence to summarize relevant details.

ghost commented 7 years ago

In our final meeting on 2017-05-26 it was agreed that the documentation and data products the team delivered satisfied our research commitments. I can be available on an ongoing basis to answer any questions pertaining to our methodology and results. Everything of consequence has been documented on our mailing list, GitHub issues and pull requests, and Google Drive folder. Our conversations have been liberally cross-referenced, however I will attempt to summarize our most important findings here in roughly chronological order.

Fall term was primarily devoted to the ongoing ~(and as yet incomplete)~ task of writing course documents. The LaTeX source of all of our documents can be found in the docs/course/ directory of the pycoal repository. Typeset versions of ~most of~ these documents can be found there and in the deliverables folder on Drive. ~The official report submissions including our documents and presentations are in the cs461, cs462, and cs463 directories of my OSU Engineering web space.~ Our final report assignment which updates and bundles up everything of note in our project is due 2017-06-13. These documents provide a reasonable overview and timeline of our project and are worth reviewing to provide context for the research paper. They introduced the goals and the framework of the project, but some details are lacking because most were written before our project evolved during major research and development in Winter and Spring terms.

Our initial research and development efforts began on the mailing list and continued on GitHub. The literature folder on Drive contains much of our background reading. The beginning of Winter term marked the start of active development. Initial comprehension of the AVIRIS data format and Spectral Python library, sketches, and outline of pseudocode were discussed in our [cs462][COAL] Bits and Pieces thread in January.

Background on neural network methods was discussed in our [cs462][COAL] Neural Networks thread. We didn't end up using machine learning due to time and technical constraints, but some useful insight was provided by Spectral Python developers and the "state of the art" active learning presentation from JPL. In my opinion, machine learning methods are a worthwhile direction for future research and development. Integration into COAL would require some fundamental changes to the current algorithm, however current data products may have applications as training data. Machine learning and data mining are an advanced topic (subject of several upper-division and graduate courses at OSU) so I would recommend a systematic review of the field with applications to spectroscopy classification before deciding on a particular approach or algorithm. Being able to generate, distribute, and use application-specific classifiers would be a powerful way to extend COAL's functionality. One possible method we discussed is to use image segmentation to detect spatial relationships between pixels such as weathering gradients, however our classified data didn't end up giving us this kind of detail.

Our case study began with imagery from the Fwd: FW: Location of coal mine thread. After searching for mine locations and imagery from other states (in particular, New Mexico, Colorado, Utah, and West Virginia), we requested high-resolution flightlines from New Mexico from JPL in the Fwd: [cs462][PyCOAL] Re: FW: AVIRIS-C/NG Mining Flight Lines thread. The data was not obtained until the beginning of Spring term and described in the Flight Lines Copying thread. We only used L2 (orthocorrected, scaled reflectance) image files and deleted the rest. The AVIRIS data is surprisingly sparse and of inconsistent quality. A future direction for research would be to integrate support for satellite-mounted sensors such as ASTER which also uses ENVI format.

We eventually decided to focus on effects of coal mining on water resources as discussed in the choose environment case study issue. The [cs462][COAL] Water Quality Data thread discussed what turned out to be extremely sparse point water quality measurements that did not end up being useful for environmental correlation. We obtained the GRaND database to this end, but as discussed in our [cs462][COAL] Meeting and Notes this was not applicable. I ended up using stream flow lines from the National Hydrography Dataset to correlate mining classifications with the locations of streams and water bodies. It was fortuitous that the USGS Digital Spectral Library 06 contained acid mine drainage and coal sludge classifications (Schwertmannite BZ93-1 s06av95a=b, Renyolds_TnlSldgWet SM93-15w s06av95a=a, and Renyolds_Tnl_Sludge SM93-15 s06av95a=a) so we could make the environmental correlation without actually having water quality data. One thing we learned from this (something industry reps at Expo were keen on) is that imaging spectrometry enables remote sensing of pollutants and other substances that have not been measured on the ground. After acquiring imagery of sufficient quality, the top priority is to obtain relevant spectral samples for domain specific applications. See the AVIRIS page on the Deepwater Horizon oil spill for related techniques to compute Oil Slope Index using minimal data points. The USGS Spectral Library Version 7 was recently released, so it should be a priority to support this data to improve COAL.

Our mineral classification efforts evolved into a Spectral Angle Mapper classifier implementation that chooses the "most likely" class out of a library of samples. Improvements were made for RGB images, performance, subset/threshold functionality, and experimental ASTER support as described by the Mineral Classification API documentation and website example. The mineral classification was the most computational-intensive part of this project which was described in the process AVIRIS imagery issue that includes the processing scripts and basic timing data.

The mining classification issue was comparatively straightforward as it simply filtered out coal mine waste classifications from the mineral classified images. The initial working implementation and examples of usage were described in the corresponding pull request. The Mining Identification API documentation and example on the website describes usage. The pycoal/tests also provide subimages and potentially-useful code examples.

The environmental correlation task intersected coal mine classifications within a given distance of hydrography data. The previous stages required GIS software for visualization, but environmental correlation required it for processing. First the vector data is reprojected and rasterized, then a proximity map is generated, and finally the mining pixels are intersected with the proximity map. The Environmental Correlation API documentation, website example, and unit test document usage. The source code is of course the ultimate reference of the implementation, happily licensed under ~Apache v2~ GPL v2.

The GIS Bundle contains all the relevant data files for the New Mexico case study and is ready to be loaded into a suitable build of QGIS. Intermediate data products from our AWS instance have been backed up in AWS Glacier. The data/Screenshots Drive directory contains particularly interesting regions that we discussed throughout the project. The website, poster, and course documents also contain pertinent images and summaries of our data. The case study data is currently not documented on the website.

This summarizes issues directly relating to the COAL implementation. However, it would not have been possible without supporting software engineering practices such as comments and API documentation, linting, unit testing, version control, project organization, decentralized development, FOSS contributions, and SDS implementation, among other things. In general, my feeling is that although our implementation was relatively simple in terms of lines of code, the background research and development required to make any progress at all was significant. At the end of the day we explored a productive method for analyzing spectral and geospatial data that has applications beyond our specific project. The techniques learned here could be used to inform other specialized applications as well as to improve the general applicability of the COAL library. It was also a good example of bringing together existing data sets (AVIRIS, The National Map, the spectral library) and software (NumPy, Spectral Python, GDAL, and QGIS) to derive novel conclusions.

@lewismc Hopefully this summary provides an improved basis for drafting the research paper. I will now unassign myself, @heidiaclayton, and @xiaomei7. Once again feel free to contact me for clarification on any of the points we discussed during research and development.

ghost commented 7 years ago

Here is a link I previously shared regarding making publication-quality maps with QGIS: http://www.qgistutorials.com/en/docs/making_a_map.html

ghost commented 7 years ago

The final report document and all other course submissions have been uploaded to the deliverables folder on Drive with the sources in the docs/course directory in git. The duplicate files in my OSU Engineering web space may be deleted at any time.

ghost commented 7 years ago

How to cite NHD: https://github.com/capstone-coal/pycoal/issues/111#issuecomment-313361324

ghost commented 7 years ago

This is tangentially related, but I recently spoke with a cartographer to ask what software they used for mapmaking. They said that the maps were drawn with commercial photo-editing software using GIS work done by others.

ghost commented 7 years ago

Potentially relevant: https://minerals.usgs.gov/science/hyperspectral-AK-mineral-deposits/index.html

ghost commented 6 years ago

Thanks to @lewismc for putting together our first publication for BiDS. We can either close this issue or leave it open to facilitate future publications.

lewismc commented 6 years ago

Thanks to everyone for participating