Discussing Priorities & Direction

mekarpeles commented 8 years ago

Should we prioritize:

inventory attempt to collaborative orchestrate inventory of academic documents (across institutions)?
crawlers focus on organizing information on existing tools & crawlers + ways to contribute?
standards/apis explore more interoperable standards / promoting sharing?
decentralize push for a paradigm shift: distributing & decentralizing storage?
classify raise awareness about a plan to classify open-access works?
end-user interfaces for better navigating research (and unifying disparate sources)

davidar commented 8 years ago

My top three:

decentralise, or at least make content and metadata easy to mirror (single points of failure are bad, geographic redundancy within a single organisation isn't enough)
standard lossless metadata format (Dublin core is too lossy, nonstandard XML schema are difficult to work with)
crawling infrastructure for existing repositories (to ease migration to the above two points)

I think the interface related stuff can run in parallel to these.

wetneb commented 8 years ago

Thinking about protocols and metadata formats would be very interesting indeed, especially since many people from different horizons have joined. What would be the scope of it? Designing our own decentralized storage and metadata format for our own use? Or design a better OAI-PMH (say), that we would like content providers to adopt? The latter is a very long shot (but exciting), and has a heavy political component. People at OpenAIRE+ have been trying to do this (basically they promote their own enhanced version of oai_dc, and are gaining momentum: https://www.mail-archive.com/goal@eprints.org/msg11122.html).

mekarpeles commented 8 years ago

@wetneb I'll try to invite someone from OpenAIRE to the community. I can see how a better OAI-PMH could be useful (something with less friction to pub/sub + handle callbacks). At the same time, BASE and CORE have demonstrated well that the very existence of OAI-PMH has allowed us to pareto principle (80% value w/ 20% work, in this case). Perhaps we can identify which remaining sources don't use OAI-PMH at all?

Redundancy. As @davidar suggests, I do think having a policy for redundancy is important, e.g. if a project like BASE was able to determine where else a paper lived. I think IPFS could be alleviate a lot of the contention between who owns what moving forward.

Dissem.in and CiteSeerX are in really interesting spaces -- tools and crawlers for collecting and classifying papers. I think raising awareness about tools and doing more research to create a coherent narrative between these tools can have a big impact. For instance EIFL and Dissem.in have a lot in common but likely aren't leveraging each other as much as they can be.

I think doing a survey and determining what projects are out there, what their goals and needs are, and then writing a paper on results could be a good way to determine what's next. Also, perhaps we can work together to create a website for discovering the right tools, like Thomas Crouzier has done: http://connectedresearchers.com/online-tools-for-researchers/

aeschylus commented 8 years ago

On the topic of decentralisation, what do people think about IPFS for a "mirror" of the content, and Mediachain, which is based on IPLD, for storing metadata. This would make it easy for anyone to contribute to guaranteed access by "pinning" the relevant files, and keep the metadata representations synchronised across systems. One major problem with OAI over the years has been synchronising repository representations, which they do the same way every other library system tries to do anything: explicitly describe each change as yet another publication.

Something like IPFS/IPLD/Mediachain, or even just torrents, would give a deeper guarantee at the "computer science" level of protocols.

aeschylus commented 8 years ago

Just noticing that Mek has already mentioned IPFS. I'll just +1 it. What would be a next step pilot for the redundancy goal? This would let us evaluate if IPFS/IPLD/Mediachain are a good choice.

mekarpeles commented 8 years ago

So, something fairly monumental is in the works. I just spoke with @jjjake and @wumpus at the Archive about running a Pilot Program to distribute + decentralize Open Access publications across all OpenJournal partners using IPFS.

@jbenet, @aeschylus, @MikeTaylor, @mwojnars, @davidar, @gdamdam, @pietsch, @cleegiles, @wetneb -- the plan is to start w/ a source like BASE (or CORE, DOAJ, paperity.org). I will use the Internet Archive's infrastructure to upload the first 10,000 papers in the collection as items into Archive.org and then take these 10,000 items and put them in the Internet Archive Labs' IPFS node. We'd like to encourage BASE, DOAJ, CORE, PLOS, and all our other able partners to do the same thing -- contributing a pilot IPFS node w/ the next 10,000 contiguous blocks of papers.

In order for this to work, the Internet Archive will need a "registry" database for itself which will map Archive.org specific item identifiers (sha256 hash) with their the corresponding IPFS hashes. We imagine other institutions will need something similar (a mapping between their ID space & IPFS hash space). Please let me know if your institution needs help.

¡Viva la Revolución!

davidar commented 8 years ago

@wetneb I'd say leading by example would be a good first step, and others can adopt it if we can show it works well

@aeschylus yeah, IPLD for metadata, with format based on something like citeproc-json (already being used by crossref et al), or maybe one of the schema.org types, was what I had in mind

@mekarpeles awesome :D

OpenJournal / central

Discussing Priorities & Direction #10