CenterForOpenScience / SHARE

SHARE is building a free, open, data set about research and scholarly activities across their life cycle.
http://share-research.readthedocs.io/en/latest/index.html
Apache License 2.0
101 stars 58 forks source link

Using ResourceSync for ArXiv Synchronization? #4

Closed laurenrevere closed 8 years ago

laurenrevere commented 10 years ago

I have been looking at various ways for gathering information from ArXiv. ResourceSync seems like it would be a great way to receive information from ArXiv do you agree? Does ArXiv even have ResourceSync compatibility at the moment?

AndrewSallans commented 10 years ago

For context, a few bits of information from conversation with Simeon at arXiv a week or so ago....

"[from Eric] Note that while ResourceSync does include a push protocol (code related to that may be what you saw Herbert announce recently), the arXiv implementation is of the pull protocol more analogous to OAI-PMH. I do believe we may want to consider implementing the push version to distribute data from the SHARE notification service, but with regard to arXiv we are looking at a harvest via ResourceSync. Simeon has some test server with a pointer to beta documentation available at http://resync.library.cornell.edu. Code for his simulator is available at https://github.com/resync/simulator.

[from Simeon] A while ago I had taken down the arXiv dataset from http://resync.library.cornell.edu ... I'm just in the process of putting the metadata portion back. Might be a day or so before the links all work. At some stage this will migrate to one of the main arXiv servers but I want to get it going here in good time for your work."

And from the Code4Lib list back in June....

"> From: Herbert van de Sompel hvdsomp@gmail.com

Subject: [resourcesync] Notification software Date: June 27, 2014 at 3:36:14 PM EDT To: "resourcesync@googlegroups.com" resourcesync@googlegroups.com Cc: Herbert van de Sompel hvdsomp@gmail.com

Hi all

I am happy to be able to announce the availability of Python software [1] that implements ResourceSync notifications.

The tool provides implementations for:

  • Source (Publisher in PubSubHubbub lingo)
  • Hub
  • Destination (Subscriber in PubSubHubbub lingo) and is compliant with:
  • The most recent version of the PubSubHubbub protocol [2]
  • The ResourceSync Notification specification [3]

Cheers

Herbert

[1] https://github.com/resync/resourcesync_push [2] https://pubsubhubbub.googlecode.com/git/pubsubhubbub-core-0.4.html [3] http://www.openarchives.org/rs/notification/0.9/notification "

JeffSpies commented 10 years ago

Can someone point @geeksnglitter to ArXiv-specific docs referring to resync? She's had a difficult time finding that information other than the simulator. For instance, is all of the data on resync? How does resync differ from their currently recommended data gathering solutions (i.e., OAI or API)? Updates do not happen on OAI--will those happen on resync?

AndrewSallans commented 10 years ago

@zimeon @efc Please jump-in here when you have time to help clarify the ResourceSync and arXiv interactions.

zimeon commented 10 years ago

There are no arXiv docs on ResourceSync support yet, still working on getting this going for you guys! Current data is available from http://resync.library.cornell.edu with the Capability List for arXIv data at http://resync.library.cornell.edu/arxiv-all/capabilitylist.xml . This includes the full Resource List and a daily Change List. At present only the internal metadata format is available (e.g. http://resync.library.cornell.edu/arxiv/ftp/arxiv/papers/0711/0711.0198.abs) but I can perhaps make the same formats as are available via OAI-PMH available (see: http://arxiv.org/help/oa). It would be good to discuss what is most useful though.

I note that the arXiv API (http://arxiv.org/help/api) is not intended to full harvesting so that should not be considered.

AndrewSallans commented 10 years ago

Thanks for that added information, Simeon. I think we'll wait for the full ResourceSync docs from you, then take a look at it, and then be back in touch for a conversation if we need help figuring out any other parts of it. That seems like the most efficient route for us right now.

zimeon commented 10 years ago

I'm going to need some input on what you want first!

laurenrevere commented 10 years ago

From what I understand of ResourceSync, I gather that it is a specification for gathering changes in a resource. If you already have full resource lists and daily change lists, isn't most of the integration done? What are your next steps?

What are the benefits of using ResourceSync over OAI? I know ArXiv supplies OAI provides lists of new submissions daily, would there be any information that could fall through the cracks there?

For us, in general, the format we would prefer would be anything with key value pairs, like json. That would be the easiest for us to deal with, but we are formatting our Scrapi to be able to handle almost any format that is thrown at it. OAI could be helpful as well, as we are writing a way to deal with most OAI formats as well.

Thank you for your help and patience!

AndrewSallans commented 10 years ago

@zimeon checking-in to see if you have had a chance to look at Lauren's comments/questions. I'm happy to arrange a time to chat if that's an easier way to sort this out.

zimeon commented 10 years ago

@geeksnglitter wrote: From what I understand of ResourceSync, I gather that it is a specification for gathering changes in a resource. If you already have full resource lists and daily change lists, isn't most of the integration done? What are your next steps?

That really all depends on what resources you are after and what changes you are looking for. The current implementation has arXiv source and internal-format metadata files available as resources to be synchronized via ResourceSync. I imagine you might find it easier to process one of the metadata formats we use for OAI-PMH (see: http://arxiv.org/help/oa) and I could make one or more of these available too (but see also comments to your third para).

@geeksnglitter wrote: What are the benefits of using ResourceSync over OAI? I know ArXiv supplies OAI provides lists of new submissions daily, would there be any information that could fall through the cracks there?

The benefits of ResourceSync over OAI-PMH are first that it is looking to the future rather than a somewhat dated protocol; but also that it should be easier to program and work with; there is flexibility with features to optimize interactions in various ways; and that I think with the push/notification parts especially it likely is a great protocol for SHARE to use to push data out.

@geeksnglitter wrote: For us, in general, the format we would prefer would be anything with key value pairs, like json. That would be the easiest for us to deal with, but we are formatting our Scrapi to be able to handle almost any format that is thrown at it. OAI could be helpful as well, as we are writing a way to deal with most OAI formats as well.

Well, depending on the library you parse with, XML can look just as key value as JSON. (Though I am all for JSON-LD for other reasons.) I could create a JSON representation of the metadata but I'd want to do that adhering as much as possible to some other standard, I don't want to roll our own for a special 1:1 exchange. Is there something you have in mind? (extra credit if it is JSON-LD)

laurenrevere commented 10 years ago

Thank you for clarifying.

The ResourceSync push notifications are very exciting. I was not aware that you were planning on implementing them for ArXiv. So that would definitely is another great reason to use ResourceSync.

I will have to talk to the ScrAPI team to see what standard they might prefer.

@fabianvf do you have some input?

laurenrevere commented 10 years ago

ScrAPI is very flexible. We parse a large number of different metadata formats and are constantly adding new parsers for unfamiliar formats.

I think that with the current change lists and resource list we can easily use ResourceSync to gather ArXiv information via ScrAPI. Which means we can move forward adding an ArXiv consumer to ScrAPI in its current state.

All we would have to do is an initial sync from the full resource list, check the change lists daily, and parse the ArXiv internal formatted data that is linked there.

Is that correct?

zimeon commented 10 years ago

The ResourceSync feed currently on export should be find for you to grab all changes in arXiv on a daily basis provided you can parse the internal metadata format. I've recently had discussions with some others regarding moving from OAI-PMH to ResourceSync and though about the possibility of replicating one or more of the XML metadata formats available through OAI-PMH (see http://arxiv.org/help/oa) as another representation of the metadata available via ResourceSync. Is that of any interest to you?

Also, in thinking about what a "notification" might mean in SHARE, I wonder how that might correspond or not to OAI-PMH or ResourceSync updates. These protocols are designed for resource replication/synchronization and as such there will be "updates" for any change. In arXiv that might be an admin going in an fixing a bad character in an abstract. Without some intelligence in parsing the updates and see what changed, you can't know why there is an update. What do you think? Is there a description of what a SHARE update is intended to be?

efc commented 10 years ago

Actually, I did not plan on the SHARE Notification Service sending any "update" notices at all. The NS is only intended to notify of the emergence of a new resource. The NS does not have to keep anything in sync, only keep a record of where the original resource resides (persistent URL) and a bit of descriptive metadata (which could be a touch wrong without much harm).

That said, our SHARE Registry may be much more interested in such updates, since there we will want to keep a more coherent record of the resource. How would such updates reach the Registry if not through the NS. I don't have an answer to that.

JeffSpies commented 10 years ago

We get update and registry features for free using the stack we've built to build SHARE functionality. If subscribers don't want updates, they'll very easily be able to turn that off.

On Thu, Aug 14, 2014 at 9:29 AM, Eric Celeste notifications@github.com wrote:

Actually, I did not plan on the SHARE Notification Service sending any "update" notices at all. The NS is only intended to notify of the emergence of a new resource. The NS does not have to keep anything in sync, only keep a record of where the original resource resides (persistent URL) and a bit of descriptive metadata (which could be a touch wrong without much harm).

That said, our SHARE Registry may be much more interested in such updates, since there we will want to keep a more coherent record of the resource. How would such updates reach the Registry if not through the NS. I don't have an answer to that.

— Reply to this email directly or view it on GitHub https://github.com/CenterForOpenScience/SHARE/issues/4#issuecomment-52182720 .

erinspace commented 10 years ago

Hi all!

Just wanted to rekindle the discussion on ResourceSync and arxiv. @zimeon, you mentioned the possibility of replicating one or more of the XML metadata formats available through OAI-PMH for ResourceSync, and yes that would absolutely be of interest! That would work perfectly. We're able to handle pretty much any format of data that's best for you, no need to come up with anything special. We've come up with a preliminary schema for the SHARE notification output, which you can see in more detail on the wiki.
The basics include:

We're not requiring every piece of data from each service, rather whatever we can gather.

Let us know if this seems reasonable!

erinspace commented 10 years ago

Hi all! So, we have a prototype version of the arXiv consumer, which grabs the arXiv ids of all the items from the resourcesync changelist located at http://resync.library.cornell.edu/arxiv-all/changelist.xml, and uses those IDs to query the arXiv export database to get the metadata in xml format.

I might be missing an obvious step here that would make this better! Comments and suggestions most welcome, both here and in the consumer issue #75

zimeon commented 10 years ago

If we were to make arXiv metadata available directly in XML via ResourceSync, what format would be most useful? Would it be the format of the arXiv API (as in http://export.arxiv.org/api/query?search_query=1011.2227) or one of the formats available via OAI-PMH (see http://arxiv.org/help/oa/index)? At the moment I don't think we'd want to come up with (yet) another XML format.

erinspace commented 10 years ago

Hi @zimeon - Either one would be just fine - we have parsers that could very easily handle either one of those! In fact I think one version of an arxiv consumer uses the arXiv API format already - though we also have a bunch of others that handle OAI-PMH format so it'd be incredibly easy to port. Whichever is best for you would be great for us!