PecanProject / pecan

The Predictive Ecosystem Analyzer (PEcAn) is an integrated ecological bioinformatics toolbox.
www.pecanproject.org
Other
202 stars 235 forks source link

Scrape Ameriflux sites -> new BETY sites #271

Closed mdietze closed 8 years ago

mdietze commented 9 years ago

We need to be able to grab site info (names, codes, lat/lon, etc) off the Ameriflux and FLUXNET pages and use that to create new BETY sites. This script will need to check the existing sites to make sure we don't create duplicates. It would be something that would be good to run periodically to catch new sites as they join the Ameriflux network.

Could start with something like...

library("RCurl") library("XML") FLUXNET_html <- getURL("http://fluxnet.ornl.gov/site_status") ## grab raw html FLUXNET_table = readHTMLTable(FLUXNET_html)[[1]] ##grab first table on the webpage

It is safe to assume any site with a code starting US- is part of Ameriflux (though there are a few other non-US sites in Ameriflux). From the Fluxnet page we would have to recurse to the individual site pages to get the lat/lon. The Ameriflux sites are listed at:

http://ameriflux.lbl.gov/sites/site-list-and-pages/

and have all the relevant site info right there, but they're formatted in a different way that's beyond my rudimentary web-scraping skills.

ankurdesai commented 9 years ago

It might be easier (and more up to date) to work with: http://ameriflux.lbl.gov/sites/site-list-and-pages/ for Ameriflux.

Ankur R Desai, Associate Professor University of Wisconsin - Madison, Atmospheric and Oceanic Sciences http://flux.aos.wisc.edu desai@aos.wisc.edu O: +1-608-520-0305 / M: +1-608-218-4208

On Dec 2, 2014, at 12:29 PM, Michael Dietze wrote:

We need to be able to grab site info (names, codes, lat/lon, etc) off the Ameriflux and FLUXNET pages and use that to create new BETY sites. This script will need to check the existing sites to make sure we don't create duplicates. It would be something that would be good to run periodically to catch new sites as they join the Ameriflux network.

Could start with something like...

library("RCurl") library("XML") FLUXNET_html <- getURL("http://fluxnet.ornl.gov/site_status") ## grab raw html FLUXNET_table = readHTMLTable(FLUXNET_html)[[1]] ##grab first table on the webpage

It is safe to assume any site with a code starting US- is part of Ameriflux (though there are a few other non-US sites in Ameriflux). From the Fluxnet page we would have to recurse to the individual site pages to get the lat/lon. The Ameriflux sites are listed at:

http://ameriflux.lbl.gov/sites/site-list-and-pages/

and have all the relevant site info right there, but they're formatted in a different way that's beyond my rudimentary web-scraping skills.

— Reply to this email directly or view it on GitHub.

robkooper commented 9 years ago

I spoke too soon, the page is created dynamically. So we might have to download it first to local disk using a browser and then scrape it.

Using the table pointed to it is not that hard, get the page, find element with name siteTable then get the tbody from that, next each tr is site with the following information, just need to place it in the right locations. Need to figure out a way that we can store the short name (BR-Sa3 in this case) and the long name (Santarem-Km83-Logged Forest). We can even create the inputs in the inputs table already since we know exact start/end times.

<tr>
<td><a href="http://ameriflux-data.lbl.gov:8080/SitePages/siteInfo.aspx?BR-Sa3">BR-Sa3</a></td>
<td><a href="http://ameriflux-data.lbl.gov:8080/SitePages/siteInfo.aspx?BR-Sa3">Santarem-Km83-Logged Forest</a></td>
<td>2000</td>
<td>2003</td>
<td>-3.0180</td>
<td>-54.9714</td>
<td>100.00</td>
<td>EBF</td>
<td>Am</td>
<td>26.12</td>
<td>2044</td>
</tr>
mdietze commented 9 years ago

It looks like all but 1 of the Ameriflux sites already in the BETY sites table is stored with a site name:

long_name (code)

As discussed during a recent telecon, I think we'll eventually want to create a new table that links all sites that are within a network (e.g. a table called "sitenetworks") that would allow us to know that these are Ameriflux (or FLUXNET, NEON, LTER, etc) sites. I'm personally fine with parsing the site name out of the site name. We could also create a tag in the notes column (e.g. AMERIFLUX_ID="US-WCr") that would allow us to unambiguously parse out site codes from different networks. Prior to creating a sitenetworks table this could also be used to (inefficiently) look up AMERIFLUX sites.

robkooper commented 9 years ago

Still need the table.

dlebauer commented 9 years ago

What would the sitenetworks table look like? Could it be handled by the sitename field in the sites table if network sites were consistently named, such as ": ", which would rename "Sylvania Wilderness (US-Syv)" to "Silvania Wilderness: Ameriflux US-Syv"?

robkooper commented 9 years ago

no for two reasons, we should not overload the sitename with metadata, and we should also have the ability to have multiple different site network table associated with the same site.

ankurdesai commented 9 years ago

For example, WillowCreek is part of ChEAS, Ameriflux, and Fluxnet, not to mention the U.S. Forest service climate tower network and within the Neon Great Lakes domain, and so on. It would be highly useful to subset sites by network/domains for various types of runs and science questions (eg observing system simulation experiments, regions, runs based on data availability). This would also be one way to make it easier to tie in data policies by network to sites.

Ankur Desai, UW-Madison +1-608-218-4208 [mobile]

On Jul 10, 2015, at 1:16 PM, Rob Kooper notifications@github.com wrote:

no for two reasons, we should not overload the sitename with metadata, and we should also have the ability to have multiple different site network table associated with the same site.

— Reply to this email directly or view it on GitHub.