EDIorg / ECC

ECC = EML Congruence Checker
5 stars 0 forks source link

Did 2019 hackathon identify any new potential checks? #25

Open mobb opened 5 years ago

mobb commented 5 years ago

ask them to comment on this issue.

atn38 commented 5 years ago

(This might overlap with planned or existing checks, coming in hot with little context here). Primarily we found that congruence between attributes as listed in metadata and in data is crucial for writing tools to leverage EML. Otherwise code needs to go an extra mile to match up the two sources of information and may not be reliable. @clnsmth concurs here.

These need to be at least WARNs if not ERRORs:

mobb commented 5 years ago

Thanks so much for these ideas! its great to see more people attempt systematic, programmatic access to data. EML was designed for that. These suggestions have come up before, with lots of interesting discussion. And the fact that the hackathon group is working hard on viz makes them very current.

Fair warning - the below is a bit of a monolog.

For a refresher on the ECC (and a history lesson for the new DMs), see http://dx.doi.org/10.1016/j.ecoinf.2016.08.001

A few comments below on the current state of checks related to these issues:

  • Same number of attributes as columns in data

There is an edge case where where that situation isn't caught. so if you have examples of datasets where the number of cols does not match the number of attributes, please add ids to this thread. We actually used this check as an example of something seemingly simple that is not so simple to implement. In the paper above, the discussion is right beneath Table 1.

  • Set of attributeNames in EML match set of column names in data table
  • Order of attributeNames in EML (first to last in attributeList) match column names left to right

Because the goal is that the attribute description can actually be used to read the data for analysis. checks have considered some of the other aspects of "matching" between metadata and data. In pie-in-the-sky discussions we've come up with a need to ensure

Some of the easier ones have been address in checks; see attributeNamesUnique, and dataLoadStatus (which uses postgres to check typing)

And we have considered a check like this one, to ensure that:

attributeName (in order) matches column header (in order)

There are some complications, among them:

  1. there are no std formats for text tables, which makes an 'acceptable table' difficult to define
  2. headers can be any number of lines. if they are more than one, does one hold the names of the attributes? if so, how do we identify it?

So the logic gets a bit tricky, the first attempt at that check was simply to display both the attributeNames and header for a user to compare them manually. See this check: headerRowAttributeNames

But now that PASTA is recording that in reports (for approx the last year or 2), we could start to analyze report content and see if reasonable logic might be developed to do more.

Warn vs error: the ECC committee will not create an unnecessarily high bar for acceptance. So that means only "unusable data" is rejected (gets an error). And for now, programmatic access is not the norm. Humans can usually figure out what to do (eg, by reading a table into R, apply manual examination, interpretation, plotting). So until programmatic access becomes the norm these sorts of checks will generate only a warn.

But again - thanks for pushing this community forward! we'll do our best to keep up. The ECC committee is a great group, and welcomes new members.

atn38 commented 5 years ago

Margaret,

Thanks for the history lesson and the context! I see that it's not always straightforward to implement these checks, as much as they make sense.

By the bye, I looked at the one BLE dataset's ECC report here https://portal.lternet.edu/nis/reportviewer?packageid=knb-lter-ble.1.5 and found that only the second entity out of four total (Elson 2015 spatial survey) has "tooFewFields" and "tooManyFields" listed as executed checks. Do checks ever run silently and not show up in reports?

More on the semantic/expanding EML side, but hackathon also brought up the need to identify which data columns contain contextualizing information (spatial, temporal, possibly taxonomic) and which ones contain measurements. Knowing that would help immensely. For example, how does a program identify if a table contains spatial coordinates columns, if yes which columns and which is x, y, z? Date/times are a bit easier to approach but also not completely straightforward.

On Fri, Aug 2, 2019 at 12:09 PM mobb notifications@github.com wrote:

Thanks so much for these ideas! its great to see more people attempt systematic, programmatic access to data. EML was designed for that. These suggestions have come up before, with lots of interesting discussion. And the fact that the hackathon group is working hard on viz makes them very current.

Fair warning - the below is a bit of a monolog.

For a refresher on the ECC (and a history lesson for the new DMs), see http://dx.doi.org/10.1016/j.ecoinf.2016.08.001

A few comments below on the current state of checks related to these issues:

  • Same number of attributes as columns in data

  • there are already several checks which attempt to ensure this. in any of your reports, see the checks called tooFewFields and tooManyFields.

There is an edge case where where that situation isn't caught. so if you have examples of datasets where the number of cols does not match the number of attributes, please add ids to this thread. We actually used this check as an example of something seemingly simple that is not so simple to implement. In the paper above, the discussion is right beneath Table 1.

  • Set of attributeNames in EML match set of column names in data table

  • Order matters; ie, the attributeName is not a key to a column somewhere in the table. So we would probably never simply look at the two sets of strings.

  • Order of attributeNames in EML (first to last in attributeList) match column names left to right

Because the goal is that the attribute description can actually be used to read the data for analysis. checks have considered some of the other aspects of "matching" between metadata and data. In pie-in-the-sky discussions we've come up with a need to ensure

  • order, uniqueness, typing, precision, range, quantity, unit, and even semantic meaning.

Some of the easier ones have been address in checks; see attributeNamesUnique, and dataLoadStatus (which uses postgres to check typing)

And we have considered a check like this one, to ensure that:

attributeName (in order) matches column header (in order)

There are some complications, among them:

  1. there are no std formats for text tables, which makes an 'acceptable table' difficult to define
  2. headers can be any number of lines. if they are more than one, does one hold the names of the attributes? if so, how do we identify it?

So the logic gets a bit tricky, the first attempt at that check was simply to display both the attributeNames and header for a user to compare them manually. See this check: headerRowAttributeNames

But now that PASTA is recording that in reports (for approx the last year or 2), we could start to analyze report content and see if reasonable logic might be developed to do more.

Warn vs error: the ECC committee will not create an unnecessarily high bar for acceptance. So that means only "unusable data" is rejected (gets an error). And for now, programmatic access is not the norm. Humans can usually figure out what to do (eg, by reading a table into R, apply manual examination, interpretation, plotting). So until programmatic access becomes the norm these sorts of checks will generate only a warn.

But again - thanks for pushing this community forward! we'll do our best to keep up. The ECC committee is a great group, and welcomes new members.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/EDIorg/ECC/issues/25?email_source=notifications&email_token=AKAZD5QQKNFXDQX4JBSH4KDQCRS4PA5CNFSM4IIJ2CJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3OKQSY#issuecomment-517777483, or mute the thread https://github.com/notifications/unsubscribe-auth/AKAZD5VD6HWRE6VLNLZ525DQCRS4PANCNFSM4IIJ2CJA .

gastil commented 5 years ago

An,

the tooFewFields and tooManyFields checks use the test of INSERT statements to postgres as part of their logic. (Notice the databaseTableCreated check shows the CREATE TABLE generated from the attributes' metadata.) In knb-lter-ble.1.5 the dataTables which pass the db load test do have the tooManyFields and tooFewFields checks. The other ones fail the db load test so do not have the column number check.

For date and time related columns that do not fit an ISO-8601 formatString, you have two choices: accept warns and miss out on column number checks, or give up using dateTIme as the measurementType. A hard choice. You can see why the dateTime check was the most complicated the ECC wg tackled.

Similar to a "profile", by restricting formatStrings to ISO-8601 we made coding more practical in scope. Given infinite resources, if a different standard than 8601 were also available, and specified in the metadata, then a wider range of formatStrings could be handled.

mobb commented 5 years ago

Clearly Gastil has a better memory than me. On your other issue:

More on the semantic/expanding EML side, but hackathon also brought up the need to identify which data columns contain contextualizing information (spatial, temporal, possibly taxonomic) and which ones contain measurements. Knowing that would help immensely. For example, how does a program identify if a table contains spatial coordinates columns, if yes which columns and which is x, y, z?

That is a good example of where semantic annotation can help.
I'll need to hunt down some examples for you, of what that would look like.