ContentMine / getpapers

Get metadata, fulltexts or fulltext URLs of papers matching a search query
MIT License
197 stars 37 forks source link

Improve CrossRef #112

Closed tarrow closed 8 years ago

tarrow commented 8 years ago

This patch:

blahah commented 8 years ago

@tarrow OK just taking a look through.

I'm not keen on polluting the interface with lots of new options, especially if they only apply to one API. I think if we want to have API-specific options we should have a layered help interface.

If a user runs a filter option with the wrong API (anything that's not Crossref), we should error out with a message explaining that filters aren't implemented for the other APIs.

But perhaps a bigger thing here is that all the different filter options so far ...

    --filter-from-index-date <date>     filter only papers indexed after date (inclusive)
    --filter-until-index-date <date>    filter only papers indexed before date (inclusive)
    --filter-from-pub-date <date>       filter only papers published after date (inclusive)
    --filter-until-pub-date <date>      filter only papers published before date (inclusive)
    --filter-until-created-date <date>  filter only papers created before date (inclusive)
    --filter-from-created-date <date>   filter only papers created from date (inclusive)

... are moving towards having our own syntax for querying. This has been on the to-do list for a while, and I think that if we're going to do it we should just do it fully and apply it to all the APIs.

That being said the filters seem to work nicely and the bug fixes in this PR are needed. I'd like to separate them out.

If the --filter* options are useful for the contentmine daily run, we should keep them in a branch that can be used there while the full syntax and interface is being implemented.

blahah commented 8 years ago

So, please can you split out the bug fixes from the crossref filtering stuff? Then we can merge the bug fixes and cut a release.

tarrow commented 8 years ago

Sure; what would you think about just retaining the --filter option, make it explicitly for the crossref api and warn if used with another api?

What I had hoped to do was to find some common ground between the apis and add similar filter options in good time. Gradually I realised (as you can see by all the options) it's actually really hard to find a useful published on date.

petermr commented 8 years ago

On Thu, Jul 7, 2016 at 9:46 AM, tarrow notifications@github.com wrote:

Sure; what would you think about just retaining the --filter option, make it explicitly for the crossref api and warn if used with another api?

It's technically possible just to use the --filter API to achieve these options. Something like:

--filter from-index-date:2016-06-22,until-index-date:2016-06-30,publisher-name:"Elsevier BV"

but the quoting gets pretty messy.

The main reason is that the Crossref api is rich and includes at least 3 independent syntaxes (and excuse any errors!)

api.crossref.org/works?filter=publisher-name:BMJ&rows=200

The first field is the type of information queried, the second is the query of filters and the third limits output

It's difficult to see how this can be fully supported in the current approach unless we pass queries verbatim or abstract the concepts . This abstraction may well be required in the medium term. Note that we are also having to abstract the metadata - which is a large job. Until that's done we may have to pass API-specific stuff.

My guess is that over the next few weeks we may wish to normalize some of the commoner metadata both for queries and for output.

Peter Murray-Rust Reader in Molecular Informatics Unilever Centre, Dep. Of Chemistry University of Cambridge CB2 1EW, UK +44-1223-763069

tarrow commented 8 years ago

I'm closing this pull request for the time being. I have made an alternative #115 which doesn't include the barrage of filter options (but does include --filter so no actual functionality is lost).

It also includes the same bug fixes as this patch