Add generalized Dissimilairity Modelling to SP

Original Issue  - 
https://code.google.com/p/alageospatialportal/issues/detail?id=304

Project Member Reported by leebel...@gmail.com, Nov 22, 2010 
Code and documentation received from Glenn Manion. Slot in integration with the 
Spatial Portal
 Mar 17, 2011 Project Member #1 leebel...@gmail.com 
Upping to HIGH as we need something basic running for meeting with Simon, 
Glenn, Kristen probably during the week of April 4.
 Labels: -Priority-Medium Priority-Critical May 12, 2011 Project Member #2 leebel...@gmail.com 
Lowering priority till UI done.
 Cc: leebel...@gmail.com Labels: -Priority-Critical Priority-High May 24, 2011 Project Member #3 leebel...@gmail.com 
Started
 Status: Started Jan 2, 2012 Project Member #4 leebel...@gmail.com 
Now to critical due to NPEI project that must be completed by June 30.
 Labels: -Priority-High Priority-Critical Feb 2, 2012 Project Member #5 leebel...@gmail.com 
Ran all of Acacia on Tasmanian extent with equal site weight and first 5 test 
layers and didn't perceive that I got any GDM output other than the input-

http://spatial-dev.ala.org.au/output/gdm/1328233963170/

[DIR] Parent Directory                             -   
[   ] domain.grd              03-Feb-2012 12:52  311   
[   ] domain.gri              03-Feb-2012 12:52    0   
[TXT] gdm_params.txt          03-Feb-2012 12:52  673   
[TXT] species_points.csv      03-Feb-2012 12:52   33K  

 Feb 6, 2012 #6 ajay.ranipeta 
There was some code reverted. Process should now generate the species file with 
Longitude, Latitude, Species_name. 

Currently working on generating graphs and including them in the metadata file. 
 Feb 8, 2012 Project Member #7 leebel...@gmail.com 
1. Need to swap over to import/paste assemblage code (which is seems to tap 
anyway)

2. Keeps cycling back to step 2 after mapping assemblage. Doesn't get to ask 
about environmental layers

3. Restrict mapped species to an area. Seems redundant as the area is defined 
in step 1? (So the answer is always "yes".

 Mar 1, 2012 #8 ajay.ranipeta 
latest code from Glenn now generates a segmentation fault.

metadata and charts should however be done (hopefully) now.
 Mar 13, 2012 #9 ajay.ranipeta 
In test.
 Status: InTest Mar 14, 2012 Project Member #10 leebel...@gmail.com 
Looking good Ajay! A few comments

1. After designating layers should "working..." be replaced by a progress bar?

2. GDM options
Generate quantile from: Data                 [this is a statement, not an 
option?]
Use geographic distance as additional predictor: yes no
Use all site pairs  [this is also a statement as there is no option?]

3a. Naming prediction layers should default to something like "My GDM 
prediction"

3b. List of environmental layers is blank even though I selected the best 5. I 
can't get past Step 4

 Status: Started Mar 14, 2012 #11 ajay.ranipeta 
1. Leaving it as "Processing... ", I think. There is really one step which 
generates the domain grid and figures out the site pairs and should give you 
the option. I could randomly set it to, say a minute but it should take less 
than a minute to process

2.
a) it was meant to be an option, but now a default is set as recommended by 
Simon/Kristen. Now I've left it there for more as an information for the end 
users

b) this is fine.

c) No, not a statement. You should be able to uncheck the box which gives you 
more parameters to play with.

3a/b. Yea, so step 1 didn't really work, which might have not really finished 
off the whole process. This default layer name should come up as "My GDM".

testing and fixing the GDM issue now.
 Mar 14, 2012 #12 ajay.ranipeta 
somehow whoever updated the GDM code on dev didn't grab the latest from SVN. 
Have done that now and should all be fine. 

test again Lee.
 Status: InTest Mar 14, 2012 Project Member #13 leebel...@gmail.com 
Thanks Ajay: A lot better.

1. The html looks good, but suggest the file be called gdm.html

2. ala.properties should be gdmparameters.txt.  Not sure how you want to 
differentiate this from gdm_params.txt

3. We need the transform grids as layers for further analysis (hover, sampling, 
scatterplot, classification, prediction). These I guess will just be scaled 0-1 
or 0-100. At the moment, there is nothing mapped from the run.

4. We need a Readme.txt file to describe all the files in the zip, as ever.

5. Would be good to substitute scientific name for species code in output 
(e.g., species frequency table)

6. Name for 'prediction layer' requested but not used?

 Status: Started Mar 14, 2012 #14 ajay.ranipeta 
1. ok

2. no, i generate the ala.properties so i can keep track of something to help 
generate the html page.

3. umm.. 

4. waiting for a final confirmation from Kristen/Simon/Glenn to get back to us 
about the current implementation and if there are any changes and any final 
file generations.

5. I'll need to generate a csv file that provides an index for species code to 
scientific name/lsid

6. huh? the file prompted for the download has the layer name set. 
 Mar 27, 2012 #15 ajay.ranipeta 
updated dev to include the output transformed grids as layer on SP. This will 
them to be available for:

- hover tool
- sampling
- other analysis tools

 Mar 27, 2012 #16 ajay.ranipeta 
(No comment was entered for this change.)
 Status: InTest Apr 1, 2012 Project Member #17 leebel...@gmail.com 
No output layers mapped at the moment.
 Status: Started Apr 3, 2012 #18 ajay.ranipeta 
(No comment was entered for this change.)
 Status: InTest Apr 3, 2012 Project Member #19 leebel...@gmail.com 
If Acacia + Eucalyptus are used for Tasmanian extent, GDM step 1 reports 0 
records per cell. It used to report a more realistic range.
 Status: Started Apr 12, 2012 Project Member #20 leebel...@gmail.com 
Thanks Ajay. A lot better. The output transformed layers are all however called 
"Transformed null"

 Apr 12, 2012 #21 ajay.ranipeta 
(No comment was entered for this change.)
 Status: InTest Apr 12, 2012 Project Member #22 leebel...@gmail.com 
Looks great! I'll get Kristen to have a play now.

 Apr 19, 2012 Project Member #23 leebel...@gmail.com 
Kristen (April 19): Summary – issues with running more than default number of 
site pairs, classification of transformed grids didn’t work, additional 
outputs needed in zip file, HTML file needs a bit more work. 

1.  “records per cell” – this may be explained when we have the help files 
(when we write them) but intuitively “records” to me means the number of 
species by locations within a grid cell (I’m assuming this is a 1km grid 
cell). I think it would be more transparent to label this “taxa per cell” 
(do we mean species? are these matched-species, is there an option to choose 
matched species? it is important to know what the taxonomic unit is for GDM). 
I’m assuming the table represents the number of taxa per cell, rather than 
the number of records per cell. We had this conversation before and checked 
with Glenn. Even if Glenn uses the label “records per cell” and assuming 
the data is actually “taxa per cell” we should present the label as “taxa 
per cell”. 
2.   “Select a threshold to help generate the site pairs” this should be 
changed to something like: “Select the minimum number of taxa in a single 
grid-cell representing an assemblage to include”. The choice of threshold is 
designed to improve the quality of the data toward a “presence-absence” 
sample by removing grid cell (sites) from consideration, not so much to help 
with generating the site-pairs, but does reduce the number of sites considered 
in generating site-pairs. I used a threshold of ‘8’. 
3.  The bar for using all site pairs should also show the % site pairs (if easy 
to calculate on the fly)
4.  The button for “use all site pairs” – should say “choose the number 
of site pairs to use” or “use default number of site pairs”. I entered a 
number but it is not clear what would happen if I switched the button on or 
left it off. Are the button and bar interchangeable – one or the other, does 
one supersede the other? The default appears to be 1% (what was the rationale 
for the default?). The default should probably be set around 1 million site 
pairs or the number of available site pairs whichever is the lowest. For my 
random example the total site pairs is 4096000. The default is 40960. This 
wouldn’t be a big enough sample to model, so I chose 1045070. My analysis is 
for Corymbia with the 5 standard predictors. I choose weight by number of 
species. 
5.  Processing: after a short time a message comes back saying the server is 
temporarily out of services, and stops the analysis although one has to 
physically close the window. Seems to say something about a bad gateway…(is 
this a problem at my end related to my network and internet settings or 
software – I’m using Google Chrome - or at the ALA server end?) 
6.  I then had to start again. Perhaps the prelim analysis could be kept and 
become a set of assemblage points in then I could start again where I left off 
and try a different number of site pairs? This time I try 539391 site-pairs, 
and note that the bar is not present until I switch off the “default”… 
again the server out of service message came up… and the result below appeared
7.  GDM processing can take a while, it might be better if it went into the 
background on the server and returned the user to the ALA interface and then 
produced an email or pop-up when the processing is complete? 
8.  The user may undertake several iterations with GDM in order to find the best 
set of variables to include in the model. Typically, I iterate in using the 
fitting function of the model and when I’m satisfied I produce the 
transformed grids. 
9.  I try again, this time using the default number of site pairs….this time 
it worked…and I have a look through the outputs, the transformed grid are 
also available for further analysis (or modelling)
10. I now create a classification based on my transformed predictors using 20 
groups…but I received a failed message – in working through the steps of 
the classification, on the last page, the layer set is not listed? Is this an 
indicator of why the classification is not working with the transformed grids? 
I tried twice and the classification step failed each time…
HTTP Status 404 - /webportal//error/HTTP_NOT_FOUND.html.var

 Apr 19, 2012 #24 ajay.ranipeta 
Kristen (April 19): Metadata info

Comments on HTML report:

Under “your options” at the top of the HTML file need to include additional 
information about the parameters used in the analysis, please include:
-          Assemblage: (e.g. Corymbia) (plus include the records for the 
assemblage in the download; add the assemblage to the map in the spatial 
portal) I was able to create my own download but have no idea what the 
taxonomic unit in the GDM analysis is. Nor do I know what the list of species 
aggregated by grid cell is?

-          Number of unique taxa: # (include a list of taxa that can be related 
to the “code” in the species_points.csv file – we’ve talked through 
this before, the need to be able to identify the taxa used in the analysis – 
exactly what were these.)

-          Taxa resolution: subtaxa (i.e., is the cut-point of the taxa set at 
matched species or are all levels of taxa included?)

-          Grid-cell resolution: 0.01

-          Minimum number of taxa per grid cell: # (e.g. 8) (This is described 
as the cut point in GDM_parameters.txt)

-          Number of grid cells with taxa included in the model: #

-          Total number of site-pairs: # (e.g., 4096000)

-          Number of site-pairs used in the analysis: # (e.g. 40960)

-          Number of predictors used in the analysis: # (e.g. 5)

-          Number of I-splines per predictor: 3 (this is a default that is hard 
wired into this version of GDM)

Create a new section: “Model Summary” (this can be drawn from the file 
gdm_parameters.txt)
-          Intercept=0.612189

-          Null Deviance=70125.728738

-          GDM Deviance=54905.896804

-          Deviance Explained=21.703635

-          All Coefficients Summed=12.193876

Charts and text:
-          The cut-point.csv table could be presented first along with 
explanatory text so that users understand how to apply this parameter.

-          Observed versus predicted compositional dissimilarity (raw data 
plot): x-axis should not be labeled after a value of 1.0 (predicted values do 
not extend this far) and red line should end at a value of 1.0

-          “site pairs” or “site-pairs” inconsistent – I would prefer 
we used hyphenated “site-pairs”

-          Observed compositional dissimilarity vs predicted ecological 
distance (link function applied to the raw data plot): “The line represents 
the perfect 1:1 fit.” Not red curve and change to “The red curve represents 
the perfect 1:1 fit.”

-          Instead of “The scatter of points signifies noise in the 
relationship between the response and predictor variables.” Say “The 
scatter of points signifies residual variation and noise in the relationship 
between the response and predictor variables.” Not all of the scatter will be 
noise, some may be systematically correlated with variables not included in the 
model.

-          The plots of each predictor variable are not correctly linked into 
the HTML file…and links to the full plots as well as thumbnails would be 
handy – same as for maxent, from memory I think you can open the full plots 
by clicking on the thumb nails

-          Data list – need to include a list at the end of the HTML file 
describing each of the datasets provided in the zip file and what they mean. If 
Ajay could start with a list, I could draft the commentary and Glenn could 
check this is right (see attached spreadsheet for starters). I guess this is 
the objective of the “readme.html” which is presently blank.

-          Need lookup table relating “species code” to the actual taxon 
name used in the GDM analysis that would enable these to be matched to an ALA 
identifier in the data downloaded for an assemblage

-          Need a lookup table relating “EnvGrid#” to the layer name.  This 
can be inferred from the gdm_params.txt

 May 27, 2012 Project Member #25 leebel...@gmail.com 
Kristen: I think it would be worth making step 4 GDM Options that produces the 
summary of species available with an Analysis ID. If the analysis fails (which 
it almost always does for me because I pick too large an extent or too many 
site pairs), one has to go back to scratch. If the entries can be saved at Step 
4, then one can make a small modification to the # site pairs and try again? 
Better still, run the GDM as a background job…so that most analyses produce a 
result. 

Lee: Ajay is currently working on background processing (#722).
 Status: Started Aug 21, 2012 Project Member #26 moyesyside 
Lee - to my knowledge this is all done and in production. Is this correct?
 Owner: leebel...@gmail.com Cc: -adam_col...@tpg.com.au -leebel...@gmail.com Aug 21, 2012 Project Member #27 leebel...@gmail.com 
No, GDM isn't complete. Kristen and I still have to discuss what may be 
required to at least tidy GDM up. Kristen has her head down on NPEI and 
probably a heap of other work so I'm leaving it till she pops her head up. 
Needless to say, I'm busier than I'd like to be at this stage as well.
 Cc: Kristen....@csiro.au moyesyside Labels: -Priority-Critical Priority-High Jul 8, 2013 Project Member #28 leebel...@gmail.com 
Lowering (Medium) until we can get some of Kristen's time. Issues:

1. GDM help review/edit
2. GDM case study

Updates to GDM will need to wait for higher priority issues to be addressed.
 Owner: Kristen....@csiro.au Cc: -Kristen....@csiro.au leebel...@gmail.com Labels: -Type-Enhancement -Priority-High Type-Task Priority-Medium
Original issue reported on code.google.com by moyesyside on 8 Aug 2013 at 12:07
bsed / ala

Add generalized Dissimilairity Modelling to SP #26