GBIF data cleaning protocol

Fireandplants / plant_gbif

This repository is for data and scripts related to plant species distribution across the globe using the Global Biodiversity Information Facility (GBIF) dataset.

4 stars 2 forks source link

GBIF data cleaning protocol #3

Open dschwilk opened 10 years ago

dschwilk commented 10 years ago

There has been ongoing discussion among all of the folks planning on using location data on how to best clean that data, deal with human impact, plantations, invasives, etc. See comments in issue thread: [https://github.com/Fireandplants/bigphylo/issues/1] And in the email thread involving all phylogeny groups people. The steps below are based on suggestions made by Sally, Caroline, Beth, Amy, Michelle, Dan and others. The needs and available data differ between the big phylogeny and detailed phylogeny projects as well as among the detailed ones, but this is an attempt.

A proposed 5-step process:

Filter by location precision (lat/lon decimal places)
Filter on human impact (method via Sally, Beth). Pick cutoff? 30? [http://sedac.ciesin.columbia.edu/data/set/wildareas-v2-human-footprint-geographic]
Filter on transformed landscape (should be less restrictive than above, but intersect anyway): [https://lpdaac.usgs.gov/products/modis_products_table/mcd12q1]
Possible filter on invasiveness? fuzzy filter based on word matching in GBIF records (which fields?) Or on list of invasives? See Michelle Greve's email 3/07/2014 and other emails in thread. I'm less optimistic about this for the bigphylo project, but a simple version may be possible. This may be quite viable for the detailed phylogeny projects. See last set of comments under [https://github.com/Fireandplants/bigphylo/issues/1]
Filter for records with >= 100 locations (relax slightly?)
Filter records near GBIF HQ - apparently some records have been assigned coordinates of GBIF when no other coordinate is available. I was told this by someone from the Tempo-Mode working group
Filter records in the ocean (obvious but important)
Highlight records in which the coordinate doesn't match the recorded country and continent as this could indicate an error in the coordinate

dmcglinn commented 10 years ago

Great start here Dylan, I'm just added items 6 - 8 to your list. Also just to reiterate I already have R scripts to carry out some of this filtering and additional obvious filtering such as removing records that don't have coordinates etc.

AmyZanne commented 10 years ago

Thanks Dylan!

We also potentially want to filter for the past 20 years? Not always but in some analyses?

On Thu, Mar 13, 2014 at 2:19 PM, Dan McGlinn notifications@github.comwrote:

Great start here Dylan, I'm just added items 6 - 8 to your list. Also just to reiterate I already have R scripts to carry out some of this filtering and additional obvious filtering such as removing records that don't have coordinates etc.

Reply to this email directly or view it on GitHubhttps://github.com/Fireandplants/plant_gbif/issues/3#issuecomment-37568318 .

Dr. Amy Zanne
Department of Biological Sciences
2023 G St. NW
George Washington University
Washington, DC 20052

Office: 352 Lisner Hall
Office Phone: (202) 994-8751
Lab: 409 Bell Hall
Lab Phone: (202) 994-9613
Fax: (202) 994-6100
Website: http://www.phylodiversity.net/azanne/

dschwilk commented 10 years ago

Thanks guys. I added those changes to a text file: https://github.com/Fireandplants/plant_gbif/blob/master/protocols/gbif_cleaning.txt

ejforrestel commented 10 years ago

Hi All~

What about filtering by date for those species which we have an excess of points (take the last 20 years), and then for those where we have less than 100 or so, use all the dates. This will maximize the number of species we get.

I also think the cutoff can be a bit lower than 100, as it might return a lower confidence interval on the estimate, but it would give us some info in a lot of cases.

This all sounds great though!

Best, Beth

On Thu, Mar 13, 2014 at 2:21 PM, AmyZanne notifications@github.com wrote:

Thanks Dylan!

We also potentially want to filter for the past 20 years? Not always but in some analyses?

On Thu, Mar 13, 2014 at 2:19 PM, Dan McGlinn <notifications@github.com

wrote:

Great start here Dylan, I'm just added items 6 - 8 to your list. Also just to reiterate I already have R scripts to carry out some of this filtering and additional obvious filtering such as removing records that don't have coordinates etc.

Reply to this email directly or view it on GitHub< https://github.com/Fireandplants/plant_gbif/issues/3#issuecomment-37568318

.
Dr. Amy Zanne
Department of Biological Sciences
2023 G St. NW
George Washington University
Washington, DC 20052

Office: 352 Lisner Hall
Office Phone: (202) 994-8751
Lab: 409 Bell Hall
Lab Phone: (202) 994-9613
Fax: (202) 994-6100
Website: http://www.phylodiversity.net/azanne/
Reply to this email directly or view it on GitHubhttps://github.com/Fireandplants/plant_gbif/issues/3#issuecomment-37568604 .

dmcglinn commented 10 years ago

hey @dschwilk do you mind if we change

https://github.com/Fireandplants/plant_gbif/blob/master/protocols/gbif_cleaning.txt

to a markdown file format so it can be rendered on github for those that don't want to do a pull and read it with a text editor?

dschwilk commented 10 years ago

Will do. Yes, sorry, I just cut and pasted my org-mode text (I am an emacs addict). Changing just now: https://github.com/Fireandplants/plant_gbif/blob/master/protocols/gbif_cleaning.md

dschwilk commented 10 years ago

Moving this comment to correct issue:

Hi Dan, One more note about the extracted records: you will have records for which the "tankname" field is an empty string. These are matches to synonyms created by the expansion script that could not be unambiguously matched back to a canonical name (unavoidable issue with structure of TPL, forcing binomial, and our canonical list). There may be a way to flag these and avoid these during the expansion step ever doing fuzzy matching on them. I probably should have had the occurrence extraction script throw them out to reduce file size. If I rerun this I can do that. But if you use the 141026 data, just ignore any row with empty tankname as we can't use those on the phylogeny.

dmcglinn commented 9 years ago

Hey @dschwilk @ejforrestel @AmyZanne @Carolin3L @SallyArchibald. I think we need to potentially discuss additional filters for the spatial coordinates. Erica Edwards has reanalyzed the Zanne et al. Nature paper. One of her criticisms has to do with the spatial filters we neglected to apply (See these code comments). Specifically she has suggested the additional spatial filters in addition to the ones we have already developed (./plant_gbif/protocols/gbif_cleaning.md):

eliminating duplicate coordinates
- this is actually slightly more complex then it first appears. I had removed records with the same GBIF occurrence_id but Edwards is suggesting in addition to remove any records in which there is a unique 'record' but those records have the same spatial coordinates and species name.
- a little digging suggests these are cases in which a plant was collected from the same local over many different years which is different than a truly duplicated record.
- so we need to decide do we treat this as one point or many?
eliminating implausible records
- records exactly at the centroid of a country or primary administrative division, e.g. a state or providence,
- records exactly at the location of a herbarium that houses >1,500,000 specimens according to http://en.wikipedia.org/wiki/List_of_herbaria
- "exactly" is defined as occurring anywhere within the 0.01° x 0.01° gridcell of the herbarium or political centroid.
eliminating imprecise records:
- Imprecise records are defined as those records that have only a single digit, because those are less precise than the climate data.
- we had already discussed this as a good one but do we like her approach here of a single digit?

My scrubbing in the Zanne et al. paper (essentially what we proposed in ./plant_gbif/protocols/gbif_cleaning.md) tossed out about 25% of the raw dataset. Edward's filters toss out 70% of the raw data.

dschwilk commented 9 years ago

Thanks Dan,

These sound reasonable. Related to "duplicated" records: there also the problem of major spatial biases in the data. But I'm not sure how to best deal with these. Does anyone have ideas on this? Do we subsample to try to "even out" geographic coverage? This is not really a data cleaning step, but is a related issue.

ejforrestel commented 9 years ago

I agree that most of those steps make sense -- it will result in a much reduced dataset but there will still be a lot there as far as representative species!

We could keep the full dataset and then play around with subsetting it for equal spatial representation if that is a relevant concern for the final analyses!

On Fri, Dec 5, 2014 at 4:25 PM, Dylan Schwilk notifications@github.com wrote:

Thanks Dan,

These sound reasonable. Related to "duplicated" records: there also the problem of major spatial biases in the data. But I'm not sure how to best deal with these. Does anyone have ideas on this? Do we subsample to try to "even out" geographic coverage? This is not really a data cleaning step, but is a related issue.

Reply to this email directly or view it on GitHub https://github.com/Fireandplants/plant_gbif/issues/3#issuecomment-65856951 .

SallyArchibald commented 9 years ago

Hi. I agree. Caroline had already suggested scrubbing on herbaria locations and i think that is essential as well as the inaccurate geolocations.

The duplicate records are less of a problem and would be resolved by a subsamling procedure anyway.

Losing large swathes of data is only an issue if we are losing it from the poorly sampled areas. The spatial bias in the grass data was huge.... about 70 percent of the data from europe.

Let's discuss subsamling later but try to clean for all inaccurate geolocations.

Sally On 6 Dec 2014 17:51, "ejforrestel" notifications@github.com wrote:

I agree that most of those steps make sense -- it will result in a much reduced dataset but there will still be a lot there as far as representative species!

We could keep the full dataset and then play around with subsetting it for equal spatial representation if that is a relevant concern for the final analyses!

On Fri, Dec 5, 2014 at 4:25 PM, Dylan Schwilk notifications@github.com wrote:

Thanks Dan,

These sound reasonable. Related to "duplicated" records: there also the problem of major spatial biases in the data. But I'm not sure how to best deal with these. Does anyone have ideas on this? Do we subsample to try to "even out" geographic coverage? This is not really a data cleaning step, but is a related issue.

Reply to this email directly or view it on GitHub < https://github.com/Fireandplants/plant_gbif/issues/3#issuecomment-65856951

.

— Reply to this email directly or view it on GitHub https://github.com/Fireandplants/plant_gbif/issues/3#issuecomment-65902835 .

Carolin3L commented 9 years ago

Hi,

Yes, absolutely, totally sensible, and I'd rather the best dataset possible even if it is much smaller. Good on Erika for looking at the data closely!

Just out of curiousity, have the data yet been scrubbed according to land cover and urban areas? This seems like the first cut, as this alone must exclude at least 50% of the data. Would this help reduce processing time for subsequent data cleaning, or is that of no consequence?

Let's assess just how spatailly uneven the data are once cleaned for land cover and urban areas. Poorly sampled areas tend to be regions with less land transformation, so we might not have such a large spatial unevenness issue once this step has been done.

Cheers, Caroline

On 7/12/2014 4:30 AM, Sally Archibald wrote:

Hi. I agree. Caroline had already suggested scrubbing on herbaria locations and i think that is essential as well as the inaccurate geolocations.

The duplicate records are less of a problem and would be resolved by a subsamling procedure anyway.

Losing large swathes of data is only an issue if we are losing it from the poorly sampled areas. The spatial bias in the grass data was huge.... about 70 percent of the data from europe.

Let's discuss subsamling later but try to clean for all inaccurate geolocations.

Sally On 6 Dec 2014 17:51, "ejforrestel" notifications@github.com wrote:

I agree that most of those steps make sense -- it will result in a much reduced dataset but there will still be a lot there as far as representative species!

We could keep the full dataset and then play around with subsetting it for equal spatial representation if that is a relevant concern for the final analyses!

On Fri, Dec 5, 2014 at 4:25 PM, Dylan Schwilk notifications@github.com wrote:

Thanks Dan,

These sound reasonable. Related to "duplicated" records: there also the problem of major spatial biases in the data. But I'm not sure how to best deal with these. Does anyone have ideas on this? Do we subsample to try to "even out" geographic coverage? This is not really a data cleaning step, but is a related issue.

Reply to this email directly or view it on GitHub <

https://github.com/Fireandplants/plant_gbif/issues/3#issuecomment-65856951

.

— Reply to this email directly or view it on GitHub

https://github.com/Fireandplants/plant_gbif/issues/3#issuecomment-65902835 .

— Reply to this email directly or view it on GitHub https://github.com/Fireandplants/plant_gbif/issues/3#issuecomment-65926213.

Dr Caroline Lehmann Lecturer in Biogeography School of GeoSciences University of Edinburgh www.carolinelehmann.wordpress.com https://environment_sensitivty_change.youcanbook.me/ The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

dmcglinn commented 9 years ago

Thanks everyone for your helpful comments! I looked back at Erica's scrubbing output and 99.9% of the records that were dropped from the reanalysis were due to the 'duplication' classification rather than due to the additional spatial filtering. So I emailed the GBIF folks this morning in regards to the duplication problem and here is the response I received from Tim Robertson of GBIF:

GBIF.org represents an index of data published by the institutions participating in the GBIF network. Data are indexed by dataset, and on the first indexing of a new dataset index records are assigned an integer by the system. On subsequent indexing of the same dataset, the systems attempts to ensure that records are updated and not duplicated by observing the identifier(s) provided by the original data holder - this is a reasonably complex process which is not without some errors.

There are therefore several cases where a true, or perceived, duplicate record may exist. The following list some examples, all of which happen, and some of which we try and clean with urgency, others we do not:

i) a dataset is inadvertently registered, and thus indexed twice (we address this as soon as possible) ii) the publisher changes their record identifier scheme (we address this as soon as possible) iii) the dataset is pre-aggregated in another dataset which is also published (e.g. same as i.) (we address this as soon as possible) iv) the physical material is split and sent to 2 institutions, both of which catalogue it and serve to GBIF v) an individual of a species is observed at the same place / time by more than 1 individual (e.g. a bird hide) vi) multiple individuals of a species are observed on the same day at the same time by 1 or more individuals (e.g. a bird hide bullet ii) vii) an individual is being tracked (e.g. radio collar) viii) data are collected but recorded only to a grid based system, with the centroid of the grid being shared by all records

Depending on the use case one can argue if these are true duplicates or not.

I would suggest you consider defining what a duplicate is in your scenario and use the data accordingly. For example, you may consider that due to the opportunistic nature of data aggregation in GBIF, you may consider grouping data to location, taxon and date, thus removing the likes of (v) and (vi) above. Similarly you might consider “gridding" the data (e.g. to 0.25 decimal degrees) and taking only 1 representative record per grid per day per species. I provide these examples only for illustrative purpose, but it is along these lines that people wrangle the data.

Sally what did you have in mind when you suggested duplicates would be handled by sub-sampling?

Tim's first suggestion is basically what Erica has done although she didn't take date into consideration. I'm assuming that Tim's second suggestion of "gridding" the data isn't what we want to do due to the patchiness of fire features. How fine of a spatial resolution does the fire classification appear to be accurate? This will be critical for setting how precise of coordinates we require for the GBIF records.

If we decided to take Erica's approach we could either do as she has and simply lump all records from a given coordinate into just one record or we could parse out records by date. The date field can be sparse, but in our previously discussions we has already suggested to use only the last 20 years for taxa with >100 records and to use all the data for other taxa. What if instead we just draw the line at no records older than 50 years in an attempt to simplify?

The data have not yet been scrubbed for for land-cover. It was suggested we could use these two layers:

http://sedac.ciesin.columbia.edu/data/set/wildareas-v2-human-footprint-geographic
https://lpdaac.usgs.gov/products/modis_products_table/mcd12q1 Do we need both layers here or can we just chose one that is 'best' to filter with?

Lastly am I correct that we decided to not try to filter out invasive species because this was too logistically difficult?

Thanks! Dan

dschwilk commented 9 years ago

Hi Dan,

To just address the non-native or invasive species issue for now: yes, we will not try to filter these. In fact, in most cases, I would argue that we would not want to. Those represent a record of a species that can perform in that environment.

-Dylan

On 12/08/2014 11:50 AM, Dan McGlinn wrote:

Thanks everyone for your helpful comments! I looked back at Erica's scrubbing output and 99.9% of the records that were dropped from the reanalysis were due to the 'duplication' classification rather than due to the additional spatial filtering. So I emailed the GBIF folks this morning in regards to the duplication problem and here is the response I received from Tim Robertson of GBIF:
GBIF.org represents an index of data published by the institutions
participating in the GBIF network. Data are indexed by dataset, and on the
first indexing of a new dataset index records are assigned an integer by the
system. On subsequent indexing of the same dataset, the systems attempts to
ensure that records are updated and not duplicated by observing the
identifier(s) provided by the original data holder - this is a reasonably
complex process which is not without some errors.

There are therefore several cases where a true, or perceived, duplicate
record may exist. The following list some examples, all of which happen, and
some of which we try and clean with urgency, others we do not:

i) a dataset is inadvertently registered, and thus indexed twice (we address
this as soon as possible)
ii) the publisher changes their record identifier scheme (we address this as
soon as possible)
iii) the dataset is pre-aggregated in another dataset which is also
published (e.g. same as i.) (we address this as soon as possible)
iv) the physical material is split and sent to 2 institutions, both of which
catalogue it and serve to GBIF
v) an individual of a species is observed at the same place / time by more
than 1 individual (e.g. a bird hide)
vi) multiple individuals of a species are observed on the same day at the
same time by 1 or more individuals (e.g. a bird hide bullet ii)
vii) an individual is being tracked (e.g. radio collar)
viii) data are collected but recorded only to a grid based system, with the
centroid of the grid being shared by all records

Depending on the use case one can argue if these are true duplicates or not.

I would suggest you consider defining what a duplicate is in your scenario
and use the data accordingly. For example, you may consider that due to the
opportunistic nature of data aggregation in GBIF, you may consider grouping
data to location, taxon and date, thus removing the likes of (v) and (vi)
above. Similarly you might consider “gridding" the data (e.g. to 0.25
decimal degrees) and taking only 1 representative record per grid per day
per species. I provide these examples only for illustrative purpose, but it
is along these lines that people wrangle the data.
Sally what did you have in mind when you suggested duplicates would be handled by sub-sampling?

Tim's first suggestion is basically what Erica has done although she didn't take date into consideration. I'm assuming that Tim's second suggestion of "gridding" the data isn't what we want to do due to the patchiness of fire features. How fine of a spatial resolution does the fire classification appear to be accurate? This will be critical for setting how precise of coordinates we require for the GBIF records.

If we decided to take Erica's approach we could either do as she has and simply lump all records from a given coordinate into just one record or we could parse out records by date. The date field can be sparse, but in our previously discussions we has already suggested to use only the last 20 years for taxa with

100 records and to use all the data for other taxa. What if instead we just draw the line at no records older than 50 years in an attempt to simplify?

The data have not yet been scrubbed for for land-cover. It was suggested we could use these two layers:

http://sedac.ciesin.columbia.edu/data/set/wildareas-v2-human-footprint-geographic

https://lpdaac.usgs.gov/products/modis_products_table/mcd12q1 Do we need both layers here or can we just chose one that is 'best' to filter with?

Lastly am I correct that we decided to not try to filter out invasive species because this was too logistically difficult?

Thanks! Dan

— Reply to this email directly or view it on GitHub https://github.com/Fireandplants/plant_gbif/issues/3#issuecomment-66155450.

SallyArchibald commented 9 years ago

Hi All,

Thanks for the clarity Dan. The finest resolution of the fire data is 500m

so we want to keep as much spatial information as possible - in fact, records that are clearly only accurate to 0.25 degrees etc should be weeded out if possible (eg, exclude records that only have one decimal place).

The suggestion of using only records from the last 20 years was mostly to avoid the badly-located data, but also to keep the location records sort of within the time period of the fire data (only the last 14 years). I suggest keeping all the data (i.e. not cleaning for date yet), but in situations where we have duplicates, keeping the one with the most recent date.

As well as the two land cover datasets mentioned by Dan Beth made another suggestion: http://www.earthenv.org/landcover.html it uses all the data from the other datasets plus more. Unfortunately, however, it groups cultivated and managed land - i.e. lots of natural rangelands that burn and still have their indigenous vegetation cover are grouped with totally transformed cultivated lands. I copy some examples below (white means 100% cultivated/managed land and it excludes large swathes of natural vegetation in Africa at least) Beth, how did you deal with this?

The human footprint index uses land cover data PLUS a whole lot of other things (like nightlights). It would be better at differentiating transformed land vs rangelands, but it is an "Index" so we don't really know what we are getting.

So perhaps the simplest would be to go back to using the MODIS land cover product - but I am not sure what its classes are. I am cc'ing Glen here to ask whether he has a copy of the data that I can have a look at. Glenn, is there a category just for "transformed land" in the Modis product?

So basically I suggest:

1: Just lumping all data at the same co-ordinate together but giving it the date of the most recent record in case we want to filter by date later. 2: Not worrying about invasive species for the reasons stated by Dylan 3: Waiting until we can have a look at the MODIS land cover to decide which product/land cover class to filter by. Any opinions about this would be welcomed.

Thanks

Sally

[image: Inline images 1]

[image: Inline images 2] [image: Inline images 3]

On 8 December 2014 at 19:50, Dan McGlinn notifications@github.com wrote:

Thanks everyone for your helpful comments! I looked back at Erica's scrubbing output and 99.9% of the records that were dropped from the reanalysis were due to the 'duplication' classification rather than due to the additional spatial filtering. So I emailed the GBIF folks this morning in regards to the duplication problem and here is the response I received from Tim Robertson of GBIF:

GBIF.org represents an index of data published by the institutions participating in the GBIF network. Data are indexed by dataset, and on the first indexing of a new dataset index records are assigned an integer by the system. On subsequent indexing of the same dataset, the systems attempts to ensure that records are updated and not duplicated by observing the identifier(s) provided by the original data holder - this is a reasonably complex process which is not without some errors.

There are therefore several cases where a true, or perceived, duplicate record may exist. The following list some examples, all of which happen, and some of which we try and clean with urgency, others we do not:

i) a dataset is inadvertently registered, and thus indexed twice (we address this as soon as possible) ii) the publisher changes their record identifier scheme (we address this as soon as possible) iii) the dataset is pre-aggregated in another dataset which is also published (e.g. same as i.) (we address this as soon as possible) iv) the physical material is split and sent to 2 institutions, both of which catalogue it and serve to GBIF v) an individual of a species is observed at the same place / time by more than 1 individual (e.g. a bird hide) vi) multiple individuals of a species are observed on the same day at the same time by 1 or more individuals (e.g. a bird hide bullet ii) vii) an individual is being tracked (e.g. radio collar) viii) data are collected but recorded only to a grid based system, with the centroid of the grid being shared by all records

Depending on the use case one can argue if these are true duplicates or not.

I would suggest you consider defining what a duplicate is in your scenario and use the data accordingly. For example, you may consider that due to the opportunistic nature of data aggregation in GBIF, you may consider grouping data to location, taxon and date, thus removing the likes of (v) and (vi) above. Similarly you might consider “gridding" the data (e.g. to 0.25 decimal degrees) and taking only 1 representative record per grid per day per species. I provide these examples only for illustrative purpose, but it is along these lines that people wrangle the data.

Sally what did you have in mind when you suggested duplicates would be handled by sub-sampling?

Tim's first suggestion is basically what Erica has done although she didn't take date into consideration. I'm assuming that Tim's second suggestion of "gridding" the data isn't what we want to do due to the patchiness of fire features. How fine of a spatial resolution does the fire classification appear to be accurate? This will be critical for setting how precise of coordinates we require for the GBIF records.

If we decided to take Erica's approach we could either do as she has and simply lump all records from a given coordinate into just one record or we could parse out records by date. The date field can be sparse, but in our previously discussions we has already suggested to use only the last 20 years for taxa with >100 records and to use all the data for other taxa. What if instead we just draw the line at no records older than 50 years in an attempt to simplify?

The data have not yet been scrubbed for for land-cover. It was suggested we could use these two layers:

- http://sedac.ciesin.columbia.edu/data/set/wildareas-v2-human-footprint-geographic

https://lpdaac.usgs.gov/products/modis_products_table/mcd12q1 Do we need both layers here or can we just chose one that is 'best' to filter with?

Lastly am I correct that we decided to not try to filter out invasive species because this was too logistically difficult?

Thanks! Dan

— Reply to this email directly or view it on GitHub https://github.com/Fireandplants/plant_gbif/issues/3#issuecomment-66155450 .

ejforrestel commented 9 years ago

While there is a separate croplands designation in the MODIS product (see link below), the natural vegetation is intermixed with some croplands too, so you run into some of the same problem as Sally mentioned.

https://lpdaac.usgs.gov/products/modis_products_table/mcd12q1

You could filter by the MODIS layer leaving the croplands/natural vegetation mosaic out, as well as use a cutoff for the human footprint? This may be stringent enough.

-Beth

On Tue, Dec 9, 2014 at 5:04 AM, Sally Archibald notifications@github.com wrote:

Hi All,

Thanks for the clarity Dan. The finest resolution of the fire data is 500m

so we want to keep as much spatial information as possible - in fact, records that are clearly only accurate to 0.25 degrees etc should be weeded out if possible (eg, exclude records that only have one decimal place).

The suggestion of using only records from the last 20 years was mostly to avoid the badly-located data, but also to keep the location records sort of within the time period of the fire data (only the last 14 years). I suggest keeping all the data (i.e. not cleaning for date yet), but in situations where we have duplicates, keeping the one with the most recent date.

As well as the two land cover datasets mentioned by Dan Beth made another suggestion: http://www.earthenv.org/landcover.html it uses all the data from the other datasets plus more. Unfortunately, however, it groups cultivated and managed land - i.e. lots of natural rangelands that burn and still have their indigenous vegetation cover are grouped with totally transformed cultivated lands. I copy some examples below (white means 100% cultivated/managed land and it excludes large swathes of natural vegetation in Africa at least) Beth, how did you deal with this?

The human footprint index uses land cover data PLUS a whole lot of other things (like nightlights). It would be better at differentiating transformed land vs rangelands, but it is an "Index" so we don't really know what we are getting.

So perhaps the simplest would be to go back to using the MODIS land cover product - but I am not sure what its classes are. I am cc'ing Glen here to ask whether he has a copy of the data that I can have a look at. Glenn, is there a category just for "transformed land" in the Modis product?

So basically I suggest:

1: Just lumping all data at the same co-ordinate together but giving it the date of the most recent record in case we want to filter by date later. 2: Not worrying about invasive species for the reasons stated by Dylan 3: Waiting until we can have a look at the MODIS land cover to decide which product/land cover class to filter by. Any opinions about this would be welcomed.

Thanks

Sally

[image: Inline images 1]

[image: Inline images 2] [image: Inline images 3]

On 8 December 2014 at 19:50, Dan McGlinn notifications@github.com wrote:

Thanks everyone for your helpful comments! I looked back at Erica's scrubbing output and 99.9% of the records that were dropped from the reanalysis were due to the 'duplication' classification rather than due to the additional spatial filtering. So I emailed the GBIF folks this morning in regards to the duplication problem and here is the response I received from Tim Robertson of GBIF:

GBIF.org represents an index of data published by the institutions participating in the GBIF network. Data are indexed by dataset, and on the first indexing of a new dataset index records are assigned an integer by the system. On subsequent indexing of the same dataset, the systems attempts to ensure that records are updated and not duplicated by observing the identifier(s) provided by the original data holder - this is a reasonably complex process which is not without some errors.

There are therefore several cases where a true, or perceived, duplicate record may exist. The following list some examples, all of which happen, and some of which we try and clean with urgency, others we do not:

i) a dataset is inadvertently registered, and thus indexed twice (we address this as soon as possible) ii) the publisher changes their record identifier scheme (we address this as soon as possible) iii) the dataset is pre-aggregated in another dataset which is also published (e.g. same as i.) (we address this as soon as possible) iv) the physical material is split and sent to 2 institutions, both of which catalogue it and serve to GBIF v) an individual of a species is observed at the same place / time by more than 1 individual (e.g. a bird hide) vi) multiple individuals of a species are observed on the same day at the same time by 1 or more individuals (e.g. a bird hide bullet ii) vii) an individual is being tracked (e.g. radio collar) viii) data are collected but recorded only to a grid based system, with the centroid of the grid being shared by all records

Depending on the use case one can argue if these are true duplicates or not.

I would suggest you consider defining what a duplicate is in your scenario and use the data accordingly. For example, you may consider that due to the opportunistic nature of data aggregation in GBIF, you may consider grouping data to location, taxon and date, thus removing the likes of (v) and (vi) above. Similarly you might consider "gridding" the data (e.g. to 0.25 decimal degrees) and taking only 1 representative record per grid per day per species. I provide these examples only for illustrative purpose, but it is along these lines that people wrangle the data.

Sally what did you have in mind when you suggested duplicates would be handled by sub-sampling?

Tim's first suggestion is basically what Erica has done although she didn't take date into consideration. I'm assuming that Tim's second suggestion of "gridding" the data isn't what we want to do due to the patchiness of fire features. How fine of a spatial resolution does the fire classification appear to be accurate? This will be critical for setting how precise of coordinates we require for the GBIF records.

If we decided to take Erica's approach we could either do as she has and simply lump all records from a given coordinate into just one record or we could parse out records by date. The date field can be sparse, but in our previously discussions we has already suggested to use only the last 20 years for taxa with >100 records and to use all the data for other taxa. What if instead we just draw the line at no records older than 50 years in an attempt to simplify?

The data have not yet been scrubbed for for land-cover. It was suggested we could use these two layers:

http://sedac.ciesin.columbia.edu/data/set/wildareas-v2-human-footprint-geographic

https://lpdaac.usgs.gov/products/modis_products_table/mcd12q1 Do we need both layers here or can we just chose one that is 'best' to filter with?

Lastly am I correct that we decided to not try to filter out invasive species because this was too logistically difficult?

Thanks! Dan

Reply to this email directly or view it on GitHub < https://github.com/Fireandplants/plant_gbif/issues/3#issuecomment-66155450>

.

Reply to this email directly or view it on GitHub https://github.com/Fireandplants/plant_gbif/issues/3#issuecomment-66259547 .

SallyArchibald commented 9 years ago

Ok yes, I would then use the MODIS to exclude only classes 12 and 13 (croplands, and Urban and built-up). I.e. don't exclude croplands mixed with natural vegetation as these would still burn and the veg present would still be able to cope with the burning regime.

But then we could add the Human Impact index as an additional filter just to be sure. I woudl have to decide on a threshold of the HI index - I think Caroline and I used 30, but I would suggest a value of 20 or 25 if we are also including a separate land cover filter.

Glenn, do you have a nice clean global MODIS land cover dataset we can use? Otherwise it is work to fit all the tiles together.

Sally

12CroplandsCroplands 13Urban and built-upUrban and built-up

On 9 December 2014 at 18:07, ejforrestel notifications@github.com wrote:

While there is a separate croplands designation in the MODIS product (see link below), the natural vegetation is intermixed with some croplands too, so you run into some of the same problem as Sally mentioned.

https://lpdaac.usgs.gov/products/modis_products_table/mcd12q1

You could filter by the MODIS layer leaving the croplands/natural vegetation mosaic out, as well as use a cutoff for the human footprint? This may be stringent enough.

-Beth

On Tue, Dec 9, 2014 at 5:04 AM, Sally Archibald notifications@github.com wrote:

Hi All,

Thanks for the clarity Dan. The finest resolution of the fire data is 500m

so we want to keep as much spatial information as possible - in fact, records that are clearly only accurate to 0.25 degrees etc should be weeded out if possible (eg, exclude records that only have one decimal place).

The suggestion of using only records from the last 20 years was mostly to avoid the badly-located data, but also to keep the location records sort of within the time period of the fire data (only the last 14 years). I suggest keeping all the data (i.e. not cleaning for date yet), but in situations where we have duplicates, keeping the one with the most recent date.

As well as the two land cover datasets mentioned by Dan Beth made another suggestion: http://www.earthenv.org/landcover.html it uses all the data from the other datasets plus more. Unfortunately, however, it groups cultivated and managed land - i.e. lots of natural rangelands that burn and still have their indigenous vegetation cover are grouped with totally transformed cultivated lands. I copy some examples below (white means 100% cultivated/managed land and it excludes large swathes of natural vegetation in Africa at least) Beth, how did you deal with this?

The human footprint index uses land cover data PLUS a whole lot of other things (like nightlights). It would be better at differentiating transformed land vs rangelands, but it is an "Index" so we don't really know what we are getting.

So perhaps the simplest would be to go back to using the MODIS land cover product - but I am not sure what its classes are. I am cc'ing Glen here to ask whether he has a copy of the data that I can have a look at. Glenn, is there a category just for "transformed land" in the Modis product?

So basically I suggest:

1: Just lumping all data at the same co-ordinate together but giving it the date of the most recent record in case we want to filter by date later. 2: Not worrying about invasive species for the reasons stated by Dylan 3: Waiting until we can have a look at the MODIS land cover to decide which product/land cover class to filter by. Any opinions about this would be welcomed.

Thanks

Sally

[image: Inline images 1]

[image: Inline images 2] [image: Inline images 3]

On 8 December 2014 at 19:50, Dan McGlinn notifications@github.com wrote:

Thanks everyone for your helpful comments! I looked back at Erica's scrubbing output and 99.9% of the records that were dropped from the reanalysis were due to the 'duplication' classification rather than due to the additional spatial filtering. So I emailed the GBIF folks this morning in regards to the duplication problem and here is the response I received from Tim Robertson of GBIF:

GBIF.org represents an index of data published by the institutions participating in the GBIF network. Data are indexed by dataset, and on the first indexing of a new dataset index records are assigned an integer by the system. On subsequent indexing of the same dataset, the systems attempts to ensure that records are updated and not duplicated by observing the identifier(s) provided by the original data holder - this is a reasonably complex process which is not without some errors.

There are therefore several cases where a true, or perceived, duplicate record may exist. The following list some examples, all of which happen, and some of which we try and clean with urgency, others we do not:

i) a dataset is inadvertently registered, and thus indexed twice (we address this as soon as possible) ii) the publisher changes their record identifier scheme (we address this as soon as possible) iii) the dataset is pre-aggregated in another dataset which is also published (e.g. same as i.) (we address this as soon as possible) iv) the physical material is split and sent to 2 institutions, both of which catalogue it and serve to GBIF v) an individual of a species is observed at the same place / time by more than 1 individual (e.g. a bird hide) vi) multiple individuals of a species are observed on the same day at the same time by 1 or more individuals (e.g. a bird hide bullet ii) vii) an individual is being tracked (e.g. radio collar) viii) data are collected but recorded only to a grid based system, with the centroid of the grid being shared by all records

Depending on the use case one can argue if these are true duplicates or not.

I would suggest you consider defining what a duplicate is in your scenario and use the data accordingly. For example, you may consider that due to the opportunistic nature of data aggregation in GBIF, you may consider grouping data to location, taxon and date, thus removing the likes of (v) and (vi) above. Similarly you might consider "gridding" the data (e.g. to 0.25 decimal degrees) and taking only 1 representative record per grid per day per species. I provide these examples only for illustrative purpose, but it is along these lines that people wrangle the data.

Sally what did you have in mind when you suggested duplicates would be handled by sub-sampling?

Tim's first suggestion is basically what Erica has done although she didn't take date into consideration. I'm assuming that Tim's second suggestion of "gridding" the data isn't what we want to do due to the patchiness of fire features. How fine of a spatial resolution does the fire classification appear to be accurate? This will be critical for setting how precise of coordinates we require for the GBIF records.

If we decided to take Erica's approach we could either do as she has and simply lump all records from a given coordinate into just one record or we could parse out records by date. The date field can be sparse, but in our previously discussions we has already suggested to use only the last 20 years for taxa with >100 records and to use all the data for other taxa. What if instead we just draw the line at no records older than 50 years in an attempt to simplify?

The data have not yet been scrubbed for for land-cover. It was suggested we could use these two layers:

http://sedac.ciesin.columbia.edu/data/set/wildareas-v2-human-footprint-geographic

https://lpdaac.usgs.gov/products/modis_products_table/mcd12q1 Do we need both layers here or can we just chose one that is 'best' to filter with?

Lastly am I correct that we decided to not try to filter out invasive species because this was too logistically difficult?

Thanks! Dan

Reply to this email directly or view it on GitHub <

https://github.com/Fireandplants/plant_gbif/issues/3#issuecomment-66155450

.

Reply to this email directly or view it on GitHub < https://github.com/Fireandplants/plant_gbif/issues/3#issuecomment-66259547

.

— Reply to this email directly or view it on GitHub https://github.com/Fireandplants/plant_gbif/issues/3#issuecomment-66306966 .

SallyArchibald commented 9 years ago

Hi all, The MODIS land cover map is available at two resolutions, ~ 500m (MCD12Q1 available at https://lpdaac.usgs.gov/products/modis_products_table/mcd12q1), and ~ 5 km (MCD12C1 https://lpdaac.usgs.gov/products/modis_products_table/mcd12c1). The 500m product is so large when mosaiced for the globe without reducing the res that I could not send it onwards, best is to download it via ftp from the site above and then use the tool provided by NASA ( https://lpdaac.usgs.gov/tools/modis_reprojection_tool) to resample and mosaic to whatever resolution and projection you desire. I have a mosaiced version of the 5km product that I could send onwards, but it would probably take me some time to upload it to a server that would be accessible to all. Might be faster to just download from MODIS and use the NASA provided tool to resample, mosaic and reproject (I find this tool quite handy and it is pretty straightforward to use). Attached is an image of what the MODIS land cover map looks like for the globe.

Cheers Glenn

On 10 December 2014 at 08:40, Sally Archibald sally.archibald1@gmail.com wrote:

Ok yes, I would then use the MODIS to exclude only classes 12 and 13 (croplands, and Urban and built-up). I.e. don't exclude croplands mixed with natural vegetation as these would still burn and the veg present would still be able to cope with the burning regime.

But then we could add the Human Impact index as an additional filter just to be sure. I woudl have to decide on a threshold of the HI index - I think Caroline and I used 30, but I would suggest a value of 20 or 25 if we are also including a separate land cover filter.

Glenn, do you have a nice clean global MODIS land cover dataset we can use? Otherwise it is work to fit all the tiles together.

Sally

12CroplandsCroplands 13Urban and built-upUrban and built-up

On 9 December 2014 at 18:07, ejforrestel notifications@github.com wrote:

While there is a separate croplands designation in the MODIS product (see link below), the natural vegetation is intermixed with some croplands too, so you run into some of the same problem as Sally mentioned.

https://lpdaac.usgs.gov/products/modis_products_table/mcd12q1

You could filter by the MODIS layer leaving the croplands/natural vegetation mosaic out, as well as use a cutoff for the human footprint? This may be stringent enough.

-Beth

On Tue, Dec 9, 2014 at 5:04 AM, Sally Archibald <notifications@github.com

wrote:

Hi All,

Thanks for the clarity Dan. The finest resolution of the fire data is 500m

so we want to keep as much spatial information as possible - in fact, records that are clearly only accurate to 0.25 degrees etc should be weeded out if possible (eg, exclude records that only have one decimal place).

The suggestion of using only records from the last 20 years was mostly to avoid the badly-located data, but also to keep the location records sort of within the time period of the fire data (only the last 14 years). I suggest keeping all the data (i.e. not cleaning for date yet), but in situations where we have duplicates, keeping the one with the most recent date.

As well as the two land cover datasets mentioned by Dan Beth made another suggestion: http://www.earthenv.org/landcover.html it uses all the data from the other datasets plus more. Unfortunately, however, it groups cultivated and managed land - i.e. lots of natural rangelands that burn and still have their indigenous vegetation cover are grouped with totally transformed cultivated lands. I copy some examples below (white means 100% cultivated/managed land and it excludes large swathes of natural vegetation in Africa at least) Beth, how did you deal with this?

The human footprint index uses land cover data PLUS a whole lot of other things (like nightlights). It would be better at differentiating transformed land vs rangelands, but it is an "Index" so we don't really know what we are getting.

So perhaps the simplest would be to go back to using the MODIS land cover product - but I am not sure what its classes are. I am cc'ing Glen here to ask whether he has a copy of the data that I can have a look at. Glenn, is there a category just for "transformed land" in the Modis product?

So basically I suggest:

1: Just lumping all data at the same co-ordinate together but giving it the date of the most recent record in case we want to filter by date later. 2: Not worrying about invasive species for the reasons stated by Dylan 3: Waiting until we can have a look at the MODIS land cover to decide which product/land cover class to filter by. Any opinions about this would be welcomed.

Thanks

Sally

[image: Inline images 1]

[image: Inline images 2] [image: Inline images 3]

On 8 December 2014 at 19:50, Dan McGlinn notifications@github.com wrote:

Thanks everyone for your helpful comments! I looked back at Erica's scrubbing output and 99.9% of the records that were dropped from the reanalysis were due to the 'duplication' classification rather than due to the additional spatial filtering. So I emailed the GBIF folks this morning in regards to the duplication problem and here is the response I received from Tim Robertson of GBIF:

GBIF.org represents an index of data published by the institutions participating in the GBIF network. Data are indexed by dataset, and on the first indexing of a new dataset index records are assigned an integer by the system. On subsequent indexing of the same dataset, the systems attempts to ensure that records are updated and not duplicated by observing the identifier(s) provided by the original data holder - this is a reasonably complex process which is not without some errors.

There are therefore several cases where a true, or perceived, duplicate record may exist. The following list some examples, all of which happen, and some of which we try and clean with urgency, others we do not:

i) a dataset is inadvertently registered, and thus indexed twice (we address this as soon as possible) ii) the publisher changes their record identifier scheme (we address this as soon as possible) iii) the dataset is pre-aggregated in another dataset which is also published (e.g. same as i.) (we address this as soon as possible) iv) the physical material is split and sent to 2 institutions, both of which catalogue it and serve to GBIF v) an individual of a species is observed at the same place / time by more than 1 individual (e.g. a bird hide) vi) multiple individuals of a species are observed on the same day at the same time by 1 or more individuals (e.g. a bird hide bullet ii) vii) an individual is being tracked (e.g. radio collar) viii) data are collected but recorded only to a grid based system, with the centroid of the grid being shared by all records

Depending on the use case one can argue if these are true duplicates or not.

I would suggest you consider defining what a duplicate is in your scenario and use the data accordingly. For example, you may consider that due to the opportunistic nature of data aggregation in GBIF, you may consider grouping data to location, taxon and date, thus removing the likes of (v) and (vi) above. Similarly you might consider "gridding" the data (e.g. to 0.25 decimal degrees) and taking only 1 representative record per grid per day per species. I provide these examples only for illustrative purpose, but it is along these lines that people wrangle the data.

Sally what did you have in mind when you suggested duplicates would be handled by sub-sampling?

Tim's first suggestion is basically what Erica has done although she didn't take date into consideration. I'm assuming that Tim's second suggestion of "gridding" the data isn't what we want to do due to the patchiness of fire features. How fine of a spatial resolution does the fire classification appear to be accurate? This will be critical for setting how precise of coordinates we require for the GBIF records.

If we decided to take Erica's approach we could either do as she has and simply lump all records from a given coordinate into just one record or we could parse out records by date. The date field can be sparse, but in our previously discussions we has already suggested to use only the last 20 years for taxa with >100 records and to use all the data for other taxa. What if instead we just draw the line at no records older than 50 years in an attempt to simplify?

The data have not yet been scrubbed for for land-cover. It was suggested we could use these two layers:

http://sedac.ciesin.columbia.edu/data/set/wildareas-v2-human-footprint-geographic

https://lpdaac.usgs.gov/products/modis_products_table/mcd12q1 Do we need both layers here or can we just chose one that is 'best' to filter with?

Lastly am I correct that we decided to not try to filter out invasive species because this was too logistically difficult?

Thanks! Dan

Reply to this email directly or view it on GitHub <

https://github.com/Fireandplants/plant_gbif/issues/3#issuecomment-66155450

.

Reply to this email directly or view it on GitHub < https://github.com/Fireandplants/plant_gbif/issues/3#issuecomment-66259547

.

— Reply to this email directly or view it on GitHub https://github.com/Fireandplants/plant_gbif/issues/3#issuecomment-66306966 .

Glenn Moncrieff

Postdoctoral Research Fellow Fynbos Node South African Environmental Observation Network (SAEON) http://www.saeon-fynbos.org/ http://www.saeon.ac.za/

Centre for Biodiversity Conservation, Kirstenbosch Gardens Private Bag X7 Rhodes Drive, Claremont 7735 Cape Town, South Africa

dmcglinn commented 9 years ago

Sounds good thanks for the guidance Sally. Glen I'll check out those links and I'll let you know if I have any trouble with the MODIS layers.

Thanks! Dan