CenterForOpenScience / SHARE

SHARE is building a free, open, data set about research and scholarly activities across their life cycle.
http://share-research.readthedocs.io/en/latest/index.html
Apache License 2.0
101 stars 58 forks source link

RSS entries without sources? #115

Closed efc closed 9 years ago

efc commented 9 years ago

I've been playing around with the RSS feed and finding a number of entries without any source field in the description. See the list below, which is pull from RSS data with a lastBuildDate of 2014-12-04 14:48:09.978009.

The sources I found in this pull from RSS included {'dataone': 133, 'spdataverse': 2, 'upenn': 22, 'pushtest': 2, 'no source': 62, 'cmu': 2, 'stcloud': 1, 'mit': 25, 'trinity': 1}

That "no source" is from records that were missing the "source" element in their description. The "link" for these records are in the list below. They seem to be from quite a few of our providers (I'm not worried about the "anexample" one, which seems to be from push testing), most of whom also have records that included sources.

Any idea why these seem to be sneaking through without the source being attributed?

http://digitalcommons.trinity.edu/psych_faculty/76 http://hdl.handle.net/10864/10334 http://hdl.handle.net/1721.1/92017 http://hdl.handle.net/1721.1/92025 http://hdl.handle.net/1721.1/92026 http://hdl.handle.net/1721.1/92029 http://hdl.handle.net/1721.1/92030 http://repository.cmu.edu/dissertations/404 http://repository.cmu.edu/dissertations/405 http://repository.stcloudstate.edu/lrs_facpubs/42 http://repository.upenn.edu/bellwether/vol1/iss80/13 http://repository.upenn.edu/bellwether/vol1/iss80/16 http://repository.upenn.edu/bellwether/vol1/iss80/2 http://repository.upenn.edu/bellwether/vol1/iss80/4 http://repository.upenn.edu/bellwether/vol1/iss80/6 http://repository.upenn.edu/bellwether/vol1/iss80/8 http://www.anexample.org/article https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.0r7g4%3Fver%3D2014-12-03T13%3A10%3A26.351-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.116td%3Fver%3D2014-12-03T11%3A40%3A41.868-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.16pj3%2F1%3Fver%3D2014-12-03T12%3A44%3A49.623-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.16pj3%3Fver%3D2014-12-03T12%3A44%3A43.282-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.1jv55%3Fver%3D2014-12-03T13%3A58%3A10.441-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.204bc%3Fver%3D2014-12-03T11%3A59%3A42.415-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.2g203%3Fver%3D2014-12-03T14%3A34%3A03.734-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.46dt3%3Fver%3D2014-12-03T12%3A50%3A31.598-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.46dt3%3Fver%3D2014-12-03T12%3A52%3A40.684-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.4k5st%2F1%3Fver%3D2014-12-03T12%3A51%3A25.187-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.4k5st%3Fver%3D2014-12-03T12%3A51%3A15.695-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.55203%3Fver%3D2014-12-03T11%3A05%3A27.683-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.5tp2v%3Fver%3D2014-12-03T13%3A04%3A26.817-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.5tp41%3Fver%3D2014-12-03T10%3A38%3A15.190-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.6bp0f%3Fver%3D2014-12-03T11%3A06%3A53.383-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.7c06s%3Fver%3D2014-12-03T13%3A22%3A23.566-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.7tv47%3Fver%3D2014-12-03T12%3A27%3A41.151-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.83qh4%3Fver%3D2014-12-03T13%3A14%3A47.170-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.86gm0%3Fver%3D2014-12-03T14%3A04%3A12.200-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.90005%3Fver%3D2014-12-03T12%3A29%3A26.717-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.96cp4%3Fver%3D2014-12-03T13%3A47%3A33.863-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.f4114%2F3%3Fver%3D2014-12-03T10%3A52%3A45.022-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.f4114%2F4%3Fver%3D2014-12-03T10%3A52%3A54.400-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.f4114%3Fver%3D2014-12-03T10%3A52%3A11.393-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.g6q07%2F1%3Fver%3D2014-12-03T14%3A35%3A12.567-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.g6q07%3Fver%3D2014-12-03T14%3A34%3A42.792-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.h4n48%2F1%3Fver%3D2014-12-03T12%3A39%3A57.230-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.h4n48%3Fver%3D2014-12-03T12%3A39%3A41.304-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.h7m7t%3Fver%3D2014-12-03T12%3A56%3A48.590-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.hc445%2F1%3Fver%3D2014-12-03T15%3A58%3A51.446-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.hc445%2F2%3Fver%3D2014-12-03T15%3A59%3A02.130-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.hc445%3Fver%3D2014-12-03T15%3A58%3A41.446-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.j96r3%3Fver%3D2014-12-03T13%3A01%3A23.526-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.jc0df%2F1%3Fver%3D2014-12-03T14%3A45%3A47.454-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.jc0df%3Fver%3D2014-12-03T14%3A19%3A52.337-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.k5786%3Fver%3D2014-12-03T14%3A57%3A03.192-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.m9160%2F1%3Fver%3D2014-12-03T15%3A46%3A15.232-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.m9160%3Fver%3D2014-12-03T15%3A45%3A57.015-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.p1s20%3Fver%3D2014-12-03T13%3A13%3A21.354-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.qn474%2F5%3Fver%3D2014-12-03T15%3A54%3A42.890-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.qn474%2F6%3Fver%3D2014-12-03T15%3A54%3A58.861-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.qn474%2F8%3Fver%3D2014-12-03T15%3A55%3A20.228-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.qn474%3Fver%3D2014-12-03T15%3A53%3A47.273-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.tm8k3%3Fver%3D2014-12-03T13%3A12%3A04.179-05%3A00 https://cn.dataone.org/cn/v1/resolve/http%3A%2F%2Fdx.doi.org%2F10.5061%2Fdryad.vp743%3Fver%3D2014-12-03T12%3A33%3A27.309-05%3A00

erinspace commented 9 years ago

Huh! This is puzzling indeed... will definitely look into it and see what we can find. Thanks for the detailed report!

efc commented 9 years ago

OK, I've found what I think is the culprit, but I still don't understand why this is happening.

It looks like the records without sources are duplicates of records with sources, but not every record has one of the duplicates. Why are we getting two records in the RSS feed for the same item? Are we somehow keeping outdated versions of some records where scrAPI ingests them? Why would we be creating outdated version of relatively new records? Do we have multiple (old and new) harvests running?

Here is an example (please forgive the slightly odd output, it comes from a python script I've been fiddling with):

title {} Two-phase westward encroachment of Basin and Range extension into the northern Sierra Nevada
link {} http://digitalcommons.trinity.edu/geo_faculty/6
description {} {
    "_id": "5480f601b5e9d7413507b733", 
    "attached": {
        "pmid": "5480f600b5e9d7413507b731"
    }, 
    "category": "metadata", 
    "collisionCategory": 1, 
    "contributors": [
        {
            "ORCID": "", 
            "email": "", 
            "family": "Surpless", 
            "given": "Benjamin", 
            "middle": "E", 
            "prefix": "", 
            "suffix": ""
        }, 
        {
            "ORCID": "", 
            "email": "", 
            "family": "Stockli", 
            "given": "Daniel", 
            "middle": "F", 
            "prefix": "", 
            "suffix": ""
        }, 
        {
            "ORCID": "", 
            "email": "", 
            "family": "Dumitru", 
            "given": "Trevor", 
            "middle": "A", 
            "prefix": "", 
            "suffix": ""
        }, 
        {
            "ORCID": "", 
            "email": "", 
            "family": "Miller", 
            "given": "Elizabeth", 
            "middle": "L", 
            "prefix": "", 
            "suffix": ""
        }
    ], 
    "dateCreated": "2002-01-01T08:00:00+00:00", 
    "dateUpdated": "2014-12-04T22:30:08+00:00", 
    "description": "", 
    "id": {
        "doi": "", 
        "serviceID": "oai:digitalcommons.trinity.edu:geo_faculty-1005", 
        "url": "http://digitalcommons.trinity.edu/geo_faculty/6"
    }, 
    "meta": {
        "docHash": "68d8ced20b61aff29cfce2f610b50fb2", 
        "uids": [
            "2fed17df83cf2aac3a34288994e48e15", 
            "8ee1920781067996fd16b942d7685c2c", 
            "2fed17df83cf2aac3a34288994e48e15", 
            "oaidigitalcommonstrinityedugeofaculty1005"
        ]
    }, 
    "properties": {
        "format": "application/pdf", 
        "identifiers": [
            "http://digitalcommons.trinity.edu/geo_faculty/6", 
            "http://digitalcommons.trinity.edu/cgi/viewcontent.cgi?article=1005&context=geo_faculty"
        ], 
        "publisher": "Digital Commons @ Trinity", 
        "set_spec": "publication:geo_faculty", 
        "source": "Geosciences Faculty Research", 
        "type": "text"
    }, 
    "source": "trinity", 
    "tags": [
        "earth sciences"
    ], 
    "timestamps": {
        "consumeFinished": "2014-12-04T23:59:04.517867", 
        "consumeStarted": "2014-12-04T23:59:01.158067", 
        "consumeTaskCreated": "2014-12-04T23:59:00.088331", 
        "normalizeFinished": "2014-12-05T00:01:52.938798", 
        "normalizeStarted": "2014-12-05T00:01:52.936621", 
        "normalizeTaskCreated": "2014-12-05T00:00:33.726819"
    }, 
    "title": "Two-phase westward encroachment of Basin and Range extension into the northern Sierra Nevada"
}
author {} trinity
category {} earth sciences
guid {} oai:digitalcommons.trinity.edu:geo_faculty-1005
pubDate {} Thu, 04 Dec 2014 22:30:08 GMT

...and just following that in the RSS feed...

title {} Two-phase westward encroachment of Basin and Range extension into the northern Sierra Nevada
link {} http://digitalcommons.trinity.edu/geo_faculty/6
description {} {
    "_id": "5480f600b5e9d7413507b731", 
    "attached": {
        "cmids": [
            "5480f601b5e9d7413507b733"
        ]
    }, 
    "category": "metadata", 
    "collisionCategory": 1, 
    "contributors": [
        {
            "ORCID": "", 
            "email": "", 
            "family": "Surpless", 
            "given": "Benjamin", 
            "middle": "E", 
            "prefix": "", 
            "suffix": ""
        }, 
        {
            "ORCID": "", 
            "email": "", 
            "family": "Stockli", 
            "given": "Daniel", 
            "middle": "F", 
            "prefix": "", 
            "suffix": ""
        }, 
        {
            "ORCID": "", 
            "email": "", 
            "family": "Dumitru", 
            "given": "Trevor", 
            "middle": "A", 
            "prefix": "", 
            "suffix": ""
        }, 
        {
            "ORCID": "", 
            "email": "", 
            "family": "Miller", 
            "given": "Elizabeth", 
            "middle": "L", 
            "prefix": "", 
            "suffix": ""
        }
    ], 
    "dateCreated": "2002-01-01T08:00:00+00:00", 
    "dateUpdated": "2014-12-04T22:30:08+00:00", 
    "description": "", 
    "isResource": true, 
    "links": [
        {
            "longName": "Digital Commons@Trinity", 
            "shortName": "trinity", 
            "url": "http://digitalcommons.trinity.edu/geo_faculty/6"
        }
    ], 
    "meta": {
        "uids": [
            "8ee1920781067996fd16b942d7685c2c", 
            "2fed17df83cf2aac3a34288994e48e15"
        ]
    }, 
    "properties": {
        "format": "application/pdf", 
        "identifiers": [
            "http://digitalcommons.trinity.edu/geo_faculty/6", 
            "http://digitalcommons.trinity.edu/cgi/viewcontent.cgi?article=1005&context=geo_faculty"
        ], 
        "publisher": "Digital Commons @ Trinity", 
        "set_spec": "publication:geo_faculty", 
        "source": "Geosciences Faculty Research", 
        "type": "text"
    }, 
    "tags": [
        "earth sciences"
    ], 
    "timestamps": {
        "consumeFinished": "2014-12-04T23:59:04.517867", 
        "consumeStarted": "2014-12-04T23:59:01.158067", 
        "consumeTaskCreated": "2014-12-04T23:59:00.088331", 
        "normalizeFinished": "2014-12-05T00:01:52.938798", 
        "normalizeStarted": "2014-12-05T00:01:52.936621", 
        "normalizeTaskCreated": "2014-12-05T00:00:33.726819"
    }, 
    "title": "Two-phase westward encroachment of Basin and Range extension into the northern Sierra Nevada"
}
category {} earth sciences
guid {} 5480f600b5e9d7413507b731
pubDate {} Thu, 04 Dec 2014 22:30:08 GMT

The differences between these two include:

For example, all the timestamps match exactly, and these are clearly from the same source reporting the same resource because the links also match. But the normalized ("description") "_id" does not match, the first has a description/id and the second a description/links, the first has a "docHash" and more "uids" than the second, they have different "guid", and of course only the first has the description/source.

chrisseto commented 9 years ago

@efc @erinspace The records without sources are "resources." Resources are the collision detected aggregate metadata for each unique event. The isResource field is there for easily filtering out events/records or visa versa.

Resources will be updated when new information comes into the system referring to an event that already exists but the original events will not be altered.

So if we get three events:

  1. Original Event from source 1
  2. Original Event from source 2
  3. Event Update from source 1

There will be four records in the system, one for each event, and then a single resource which contains the data from all the events merged into one.

efc commented 9 years ago

@chrisseto, what you explained to @erinspace and I about the cause of these "no source" records raises a number of questions for me. One of the fundamental tenets of the notification service is that we get one report of an event in, we send one report out. As envisioned, the NS also would not try to "update" records, but I think some of this collision detection work is changing that.

You have also used the word "resource" in a way I am not familiar with. I usually talk about "providers" (or "sources") sending us "reports" (or our "harvesting" from them and generating the "reports" from our harvests) about "research release events" that refer to "resources" created in the research ecosystem. With this in mind:

In your usage, "resource" is something else, it appears to be some kind of aggregation of information about related events.

Furthermore, the resource records you describe span multiple "sources" (or "providers").

CASE A: Lets say we harvest a provider twice, and the records they provide overlap somewhat. In that case me may get duplicate records in the OAI-PMH feed that should NOT become duplicate "event" reports, because they all refer to the same event, the deposit of a given paper.

CASE B: Lets say we harvest from a given provider at two different times and for their own reasons they have updated a record we got earlier. In this case the record may not be a duplicate record, but the record still does not refer to a new "event." Rather, it modifies our understanding of an event we already knew about and recorded. In that case we could decide to discard the updated information and stick with our previous understanding of the event, or to update our event "report" to incorporate the new information. But we still must understand this as a single "event" relating to a given provider. The clue that this is still the same event is probably the persistent link to the "resource" (my meaning of the word) itself.

CASE C: Lets say that we harvest a report of a given resource (my meaning again) from provider A, and then later get a report of what appears to be the same resource from provider B (for example, a preprint with two authors at two different institutions deposited to both institutional repositories). In this case we should maintain reports of each "event" (the appearance of the paper in A and also its appearance in B). This is NOT a duplicate "event" even though the paper may be the same.

From what you've described, right now the NS does not behave this way. In each of these three cases we would have two "event" records and one "resource" (in your sense of the word) record.

However, from the RSS feed and other "subscription" sources the notification system should actually provide in CASE A just one notification, in CASE B just one notification (either updated or not), and in CASE C two notifications (one from each provider).

It is OK if internally things are represented differently to assist in collision detection, but to the user of the service it has to appear that our getting a report about one "research release event" results in one "notification."

efc commented 9 years ago

By the way, the collision of the term "resource" turns out to be evident already in our glossary. This is just the first time I've come across it in real life.

efc commented 9 years ago

A few more questions about the "resource records." I'm not necessarily looking for quick answers, I just don't want to forget the questions!

How does "isResource" help filter these records, since more commonly I imagine we would need to find those things that are NOT resources. Since the "isResource" field is not even present in non-resource records, this makes filtering a bit awkward (though doable). If the dataset is to have records of multiple "types" mixed together, wouldn't it be more flexible to have a "type" field with entries like "event" and "resource"? That would also allow us to grow beyond two types, if we ever needed to.

I've found it is easier to just filter for items that have an actual source. That technique, however, leaves me wondering whether all resources are sourceless? And why don't the resource records have a "source" since they clearly come from a source?

How does additional information brought to the notification service though these resource records get incorporated into results? What kind of post-query processing do you envision to tie events and their halo of collision-inspired resources together?

JeffSpies commented 9 years ago

We're removing events with isResource from default output. As such, closing.