EOL / ContentImport

A placeholder for DATA tickets everytime Jira is un-available.
1 stars 1 forks source link

EOL content summary resource #11

Open jhammock opened 4 months ago

jhammock commented 4 months ago

Precipitated by a discussion of per-taxon content metrics with @metasj , though also weirdly similar to #6 . We would like to institute a simple (maybe 3- or 4-component?) content summary within EOL. Proposed measurements, for discussion:

I could stop there, but of course we could also include text objects, vernacular names (language count), and/or measure depth on some of the above.

KatjaSchulz commented 3 months ago

I think it may be worth starting with a comprehensive page inventory that we can then leverage in different ways, e.g., by developing a rationale for a minimally informative page or a new concept of a "rich" page and pages that are rich or poor with respect to certain content types. So I would want to include number of articles in this inventory, along with the article subjects and languages. For the media, I would want a count of media objects by media type (image, video, sound). At some point we could start leveraging the computer vision code to provide further categories for images. For trait data, I would want a count of all data records, a count of the measurementTypes and a list of all measurementTypes represented on the page. This would allow us to give special consideration if a page has measurementTypes from a meaningful/informative list. It may also make sense to list the values for categorical data, making exceptions for values of certain measurement types for people, institutions, and geography.

jhammock commented 3 months ago

That's fine with me. You're right- we can derive simpler metrics from this as the needs arise. I can't think of any other value types that would be numerous and unexciting, so, skip counting and listing measurementValues for

http://eol.org/schema/terms/TypeSpecimenRepository and http://eol.org/schema/terms/Present

what about children of /Present? Native, introduced, adventive, etc? I'd certainly want to know if they were there, but I think I could skip counting/listing geographic values for those too.

I don't think people feature in our data yet as values.

Apart from that, do we want a count of records per measurement type (and/or per value, for values we are listing)?