CatalogueOfLife / coldp

32 stars 11 forks source link

add geologic time range fields for fossils #17

Closed mdoering closed 5 years ago

mdoering commented 5 years ago

For palaeo taxa it is key to know the geologic time the organism was known to have lived, i.e. the range of geologic times it is known from the fossil record.

Implementation would be best based on a start and an end field (integer or double) representing million years (Ma). Alternatively a known geological times like "Trias" or "Juras" would be an option for the start and end as it is likely to be the most common format datasets are using.

mjy commented 5 years ago

I highly recommend you use the paleodb api format. They have thought things through, have an API and picklists, etc. The model amounts to 4-6 fields.

https://paleobiodb.org/data1.2

mdoering commented 5 years ago

Thanks Matt, can you point me to the specific bit that is relevant? I find their API a little difficult to use. As far as I can see they also base ranges on the earliest and latest million year, sometimes called min_ma/max_ma: https://paleobiodb.org/data1.2/intervals/list.txt?scale=1 Sometimes eag/lag: https://paleobiodb.org/data1.2/intervals/single.json?id=16

What I cannot find is any assertion about an organism or even single specimen to occur in some geological time.

mjy commented 5 years ago

We used these fields. For our schema we added them to a collecting event concept, i.e. they further "localize" where specifically the colleciton object was found.

For your purposes I believe you can add them to name_usage, but maybe distribution, or a new "temporal distribution".

group                            | character varying           |
 formation                        | character varying           |
 member                           | character varying           |
 lithology                        | character varying           |
 max_ma                           | numeric                     |
 min_ma                           | numeric                     |

The idea is roughly this- If the user has exact data, then they should use max_ma, min_ma, this is the measured years. Since the start and end of named geological units changes all the time (literally every year), then these data will always let you adjust to those changes, i.e. you can calculate group and formation.

To display values that aren't specific, or to cache the labels used that correspond to max/min, accodring to the curators whim, you can include the other values (formation, member, etc). IIRC we only pick from group and formation vocabs, member and lithology are a wild west and don't have picklists from PDBD.

mdoering commented 5 years ago

The International Commission on Stratigraphy is providing regular updates to the geological timescales.

mdoering commented 5 years ago

I would propose to either follow the terminology used on Wikipedia and call it temporalRange or call it livingPeriod just as the DwC Species Profile extension does.

Either as single range fields: temporalRange: Early Cretaceous–Late Cretaceous temporalRangeMa: 112.03–93.5

Or as dedicated start/end fields which is the terminology used in the official UML model for the geological time scales: temporalRangeStart: Early Cretaceous temporalRangeEnd: Late Cretaceous temporalRangeMaStart: 112.03 temporalRangeMaEnd: 93.5

My preference is the later 4 fields

mjy commented 5 years ago

4 fields better than 2 for sure.

My only thought is whether temporalRange is just going to me a mess of uncomparable values? I.e. there are realtively usefully constrained (albeit not without problems) categorical data available that could be referenced a little more precisely. This is why we went with a couple fixed values for the more standard temporal ranges that folks generally use. Again, as this is for species, and PDBD is arguably the place that has actually done the most practical modelling and data accumulation in this regard, it might be worthwhile to align with them (i.e. official UML is nice, but who has implemented it with real data). In other words, if there are clear refrences to some controlled vocabularies to be included in the temporalRangeStart/End then I think things are fine, otherwise I suspect the data will be mostly useless when compiled.

Folks also want to reference the lithology (names for physical layers), as we know from working with our fossil curators. This is a different concept than temporal range, and that data needs to be clearly isolated. It might be worthwhile including a field so that it's clear the model understands the differences.

mdoering commented 5 years ago

Yes, I was definitely thinking to point to some controlled vocabulary like we do for ranks and other enumerations. Validation would pick up uncommon values, but we can easily write parsers that understand various values for the same canonical value just like we do with ranks.

mdoering commented 5 years ago

Do you think it is relevant for species information to list the lithology / rock formations it is known to be found in? Is that maybe sth more for the distribution records? Well, not sure if the distribution as it is makes much sense for fossils to be honest.

Example lithology from palaeodb:

lime mudstone,sandstone,mudstone,"shale",claystone/"limestone",sandstone/"shale",sandstone/siltstone
mjy commented 5 years ago

It is distribution related in my mind. All this data is- temporal/spatial distribution.

If I have a taxon T, then I can localize it in time/space. I can localize space to a finer degree if I know what "stone" in what area is being referenced. Maybe not important for CoL, but was definitely requested by our paleo folks. Haven't seen what they are putting in it.

As aside note, asserting "fossil" has always seemed a little odd to me. Being a fossil can be inferred with relatively high probability based on distribution?

mdoering commented 5 years ago

The more I think about it the more I feel lithology should be a well interpreted field in GBIF to search specimens on. @timrobertson100, sth to pick up?

What CoL is dealing with is mostly the summarized expert opinion rather than the underlying primary data. Individual specimens and occurrences would obviously allow you to derive the ranges in time & space.

mdoering commented 5 years ago

Summarising again the various terminologies found in existing fossil resources:

PaleoBioDB is using at least 3 slightly different ways in their API and webpages. Age range, min_ma & max_ma for min/max million years and eag & lag for earliest/latest age is used in different places in their API and webpages.

An interesting blog post about PaleoBioDBs fossil age data mentions first and last appearance dates (FADs and LADs) of taxa.

GBIF and DwC have used livingPeriod for a decade now which better applies also to extant species for which age is usually understood as a property of an individual or the average/max age of a species.

Wikipedia uses temporalRange for the well known species pages infoboxes.

The International Fossil Plant Names Index again uses Stratigraphy.

mdoering commented 5 years ago

As all geological temporal information is ultimately tight to stratigraphy and the geological periods I would say we don't even need to exchange the million years directly but should rather exchange only the stable and well defined geological time periods. Mixing pure Ma data from different sources that are likely to use different versions of period definitions can easily cause confusion otherwise.

I would avoid the use of age and stratigraphy as we also deal with extant species. That would leave the following options:

temporalRangeStart: Early Cretaceous temporalRangeEnd: Late Cretaceous

or

livingPeriodStart: Holocene livingPeriodEnd: Holocene

with temporalRange being my preference for a rather neutral term.

mjy commented 5 years ago

I like temporalRange.

mdoering commented 5 years ago

I am trying to get all geochronological times from PBDB via their API. This works fine for scale=1: https://paleobiodb.org/data1.2/intervals/list.txt?scale=1 But any other scales do not return anything, even though they claim to have 5. Ah, you can get them all with scale=all: https://paleobiodb.org/data1.2/intervals/list.txt?scale=all And that shows all others have no scale/unit/rank given, a pitty.

Wikipedia knows about many from various sources, not just ICS. But I cannot find a CSV or easily parsable source for their data yet. DBpedia should have that: https://en.wikipedia.org/wiki/List_of_geochronologic_names#cite_note-1

INPSIRE provides them: http://inspire.ec.europa.eu/codelist/GeochronologicEraValue

German ones can be found here: https://www.geokartieranleitung.de/Fachliche-Grundlagen/Stratigraphie-Kartiereinheiten/Stratigraphie-der-Bundesrepublik/Chronostratigraphische-Einheiten

GeoSciML has the 2017 edition of ISC in various linked data formats: http://resource.geosciml.org/vocabulary/timescale/isc2017.jsonld

ISC unfortunately seems to publish only pdfs and images, nothing to parse really