Repository deposit information

richard-jones commented 8 years ago

Some concrete requirements from Monitor, which would be good if we could provide:

Repository Name (this is from a controlled list)
Location URL (URL of the item in the repo)
Version Deposited (using the same terminology as RIOXX)
Date deposited

We should review what our options are on getting this information.

richard-jones commented 8 years ago

I have done a review of CORE, and believe our next best options for extending the deposit information are as follows:

Collect the Repository URL if it is available. In the CORE data this is in repositories.uri (see below for what we might do if this information is not available)
Collect the OAI identifier of the item, which will be in the field "oai"
Collect the fulltext URL of the item as it appears in the original repositories (see below for more details)

Repository base URL:

If there is no repository base URL, we can look in OpenDOAR (http://opendoar.org) or ROAR (http://roar.eprints.org) to map from the name of the repo to the url. Probably the best thing to do is download and keep locally a copy of their datasets (I haven't been able to locate the download option for opendoar just now, but I'm sure there is one)

Fulltext URL:

To get the fulltext URL of the item from the original repositories, you can do the following with the CORE API:

https://core.ac.uk:443/api-v2/articles/get?metadata=true&fulltext=false&citations=false&similar=false&duplicate=false&urls=true&faithfulMetadata=false&apiKey=:yourapikey:

The important thing to note is that "urls=true" in this request, which causes CORE to give you all the known fulltext urls for the item. This includes the core ones, so we then need to filter those out and capture the fulltext urls only for the external repositories.

In terms of how we handle this data, I suggest that internally we keep all this information in separate fields, such as:

repository
- name
- url
fulltext
- urls
- oais

And then in the download CSV compress this into two or three columns (not sure which is better):

Repository: Name (URL) Repository Fulltext URLs: URLs Repository OAI IDs: OAIs

richard-jones commented 8 years ago

I think probably once we've achieved this, we may want to push on a little further. There are 2 things I'm considering:

There is a field called "repositoryDocument" in CORE, but I haven't yet seen a record which populates it - would be good to find out what's in it
Go direct to the repository for more metadata, and especially for deposit dates

markmacgillivray commented 8 years ago

Given issue #67 should this still be done now? I do not know if @emanuil-tolev did any more investigating into that issue or not, so @richard-jones let me know if you want this completed anyway.

richard-jones commented 8 years ago

We do need to obtain more information on the repositories, so the short answer is "yes".

The long answer is that we need to mitigate these problems with CORE - whether that's by cross-referencing with CrossRef to make sure we've identified the right record in CORE, or something else.

I have a bit more information we can use to test this now, so I'll get that into the issue tracker this evening.

richard-jones commented 8 years ago

co-assigning myself, as I need to track the changes to this for how it affects the API, any changes to which can now break the Monitor integration

markmacgillivray commented 8 years ago

Updated on dev. Will push to live soon - @richard-jones read this and let me know if suitable for live first.

Searches CORE and stores a repositories key pointing to a list of objects, which can have the keys name, oai, url, fulltexts. If CORE uri is present it becomes url, if not then opendoar is searched and the rUrl it returns is used. fulltexts is all the URLs returned by CORE search except those that are in core.ac.uk domain.

Formatting for wellcome throws this info away, as repo info does not feature in their CSV. For lantern, it get changed to lists in keys called "Repositories", "Repository URLs", "Repository fulltext URLs", "Repository OAI IDs"

There is no longer an "Archived repositories" key in the results data.

Here is a request to CORE for an article in a repo that CORE does not return the URL for:

https://core.ac.uk/api-v2/articles/search/doi:%2210.1186/1471-2458-6-309%22?&urls=true&apiKey=

Here is an example call to OpenDOAR for the repo name of the above article:

http://opendoar.org/api13.php?fields=rname&kwd=Aberdeen%20University%20Research%20Archive

And the API "use" endpoint that queries it and converts the result to json:

https://dev.api.cottagelabs.com/use/opendoar/search/Aberdeen%2520University%2520Research%2520Archive

In which we can see the rUrl value, which is now used to update the repositories information.

Here is the internal result of a job submitted to the dev API, where you can see the repositories key and the repo info, including the fulltext URLs and the repo URL, and provenanc statements to show that this took place:

https://dev.api.cottagelabs.com/service/lantern/w62Ky4S3hymaxpDgL/results

Here is the result formatted in csv for lantern:

https://dev.api.cottagelabs.com/service/lantern/w62Ky4S3hymaxpDgL/results?format=csv&wellcome=false

richard-jones commented 8 years ago

This is the ideal functionality, yes, thanks.

There's an issue with the fulltext urls, in that only 1 of them is the repository fulltext url:

fulltexts: [
    "http://dx.doi.org/10.1186/1471-2458-6-309",
    "http://hdl.handle.net/2164/3837",
    "http://creativecommons.org/licenses/by/2.0),",
    "http://www.biomedcentral.com/1471-2458/6/309"
],

The first is the original doi, the second is the repository copy, the third might not even resolve to anything, but is at best the licence, and the fourth I guess is the original.

Is this an issue with data coming from CORE?

markmacgillivray commented 8 years ago

Yep that's the data from CORE minus the URLs that were in the core.ac.uk domain. The creative commons one is broken, but again that's how it was in CORE.

On 22 Aug 2016 10:55, "Richard Jones" notifications@github.com wrote:

This is the ideal functionality, yes, thanks.

There's an issue with the fulltext urls, in that only 1 of them is the repository fulltext url:

fulltexts: [ "http://dx.doi.org/10.1186/1471-2458-6-309", "http://hdl.handle.net/2164/3837", "http://creativecommons.org/licenses/by/2.0),", "http://www.biomedcentral.com/1471-2458/6/309" ],

The first is the original doi, the second is the repository copy, the third might not even resolve to anything, but is at best the licence, and the fourth I guess is the original.

Is this an issue with data coming from CORE?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/CottageLabs/LanternPM/issues/43#issuecomment-241366272, or mute the thread https://github.com/notifications/unsubscribe-auth/AAuXCHNVqWfZBP_HaN8qWeQkF8RMUsOyks5qiXIpgaJpZM4In9h- .

richard-jones commented 8 years ago

Ok, we'll need to think a bit about how to clean that up, otherwise people will think that it's our fault. A couple of thoughts:

1/ Only take urls from the institution's domain 2/ resolve all the urls and see if they resolve to the institution's domain (which would catch the handle server one, for example)

I'd then keep all the other urls somewhere else in the record, as that's useful information, but it isn't telling us where the repository copy is.

On 22 August 2016 at 10:57, markmacgillivray notifications@github.com wrote:

Yep that's the data from CORE minus the URLs that were in the core.ac.uk domain. The creative commons one is broken, but again that's how it was in CORE.

On 22 Aug 2016 10:55, "Richard Jones" notifications@github.com wrote:

This is the ideal functionality, yes, thanks.

There's an issue with the fulltext urls, in that only 1 of them is the repository fulltext url:

fulltexts: [ "http://dx.doi.org/10.1186/1471-2458-6-309", "http://hdl.handle.net/2164/3837", "http://creativecommons.org/licenses/by/2.0),", "http://www.biomedcentral.com/1471-2458/6/309" ],

The first is the original doi, the second is the repository copy, the third might not even resolve to anything, but is at best the licence, and the fourth I guess is the original.

Is this an issue with data coming from CORE?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/CottageLabs/LanternPM/issues/ 43#issuecomment-241366272, or mute the thread https://github.com/notifications/unsubscribe-auth/AAuXCHNVqWfZBP_ HaN8qWeQkF8RMUsOyks5qiXIpgaJpZM4In9h- .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CottageLabs/LanternPM/issues/43#issuecomment-241366747, or mute the thread https://github.com/notifications/unsubscribe-auth/AA0QShsXLUz8btr_N5S9mrJQLE2F0zwBks5qiXKngaJpZM4In9h- .

Richard Jones, Founder, Cottage Labs https://cottagelabs.com || @cottagelabs

Lantern: https://lantern.cottagelabs.com Repository Solutions: https://cottagelabs.com/repository

richard-jones commented 8 years ago

What would also be very useful would be if we can tell if these are fulltexts or just metadata-only records. Can we distinguish between the repository splash page and the actual fulltexts on that page?

The objective will be to distinguish between a repository which only has the metadata, and a repository which has both the metadata and the fulltexts. Is that possible using the data from CORE? e.g. do they say whether there is a PDF that they've mined?

markmacgillivray commented 8 years ago

Even if they did, how would we know which URL it is at, short of looking to see if it ends with "PDF". Another alternative is guessing based on page body length.

On 22 Aug 2016 19:07, "Richard Jones" notifications@github.com wrote:

What would also be very useful would be if we can tell if these are fulltexts or just metadata-only records. Can we distinguish between the repository splash page and the actual fulltexts on that page?

The objective will be to distinguish between a repository which only has the metadata, and a repository which has both the metadata and the fulltexts. Is that possible using the data from CORE? e.g. do they say whether there is a PDF that they've mined?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/CottageLabs/LanternPM/issues/43#issuecomment-241498997, or mute the thread https://github.com/notifications/unsubscribe-auth/AAuXCDMowSQB9gBdJQqD_wZX8TYp9IOKks5qieV4gaJpZM4In9h- .

richard-jones commented 8 years ago

It's probably even more difficult than that, because for example that handle URL will be a link to the repository splash page, and so you would actually have to know that there were fulltext links /on/ that splash page before you made a call either way.

I know that CORE do fulltext indexing, which I presume they do by mining the files they discover via OAI, so I was wondering if it was possible to tell whether they'd done any fulltext indexing for a given record, which would tell us if it was fulltext.

markmacgillivray commented 8 years ago

We'd probably have to look at fields other than the fulltext URLs field then, and just ignore the fulltext URLs. Splash pages certainly can't be reliably trawled for further links to fulltext. The best I've tried in the past is that for HTML fulltext they're usually in the page so you can check by length, and alternatively look for a PDF link on the page that can actually be retrieved. But the only way to identify such a link is to assume that having the string "PDF" in it means it is a PDF - then it can be tried. Casual observation on contentmine work suggest that very often such links do have the string "PDF" in them (not necessarily capitalised) - not always at the end as a file type, but either at the end or somewhere in the address after the main domain part.

On 22 Aug 2016 19:13, "Richard Jones" notifications@github.com wrote:

It's probably even more difficult than that, because for example that handle URL will be a link to the repository splash page, and so you would actually have to know that there were fulltext links /on/ that splash page before you made a call either way.

I know that CORE do fulltext indexing, which I presume they do by mining the files they discover via OAI, so I was wondering if it was possible to tell whether they'd done any fulltext indexing for a given record, which would tell us if it was fulltext.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/CottageLabs/LanternPM/issues/43#issuecomment-241500609, or mute the thread https://github.com/notifications/unsubscribe-auth/AAuXCKncx_NnfPUfNxnOCd9EoktHg8QQks5qieaxgaJpZM4In9h- .

markmacgillivray commented 8 years ago

Have added a check that the URL is retrievable, and if it is, then it resolves the URL. If the repo URL is known, then the resolved URL is only saved if the domain of the repo URL is in the fulltext URL; if repo URL is not known, then the resolved URL is kept anyway. We still won't know if we have pointed to a metadata or a fulltext or a page with a link to a fulltext, but at least now it throws away dups and fails, and resolves down to an actual repo URL. Given that CORE is supposed to be an index of stuff in repos, I think we could expect this to usually find something suitable, if there are any fulltext URLs in the data returned from CORE

markmacgillivray commented 8 years ago

Have added to live. @richard-jones I will close this issue but if you want to make a new one for possibly finding other info in other parts of CORE response, it seems beyond what we have done here but could be worth looking into later.

CottageLabs / LanternPM

Repository deposit information #43