adsabs / adsabs-dev-api

Developer API service description and example client code
163 stars 58 forks source link

Metrics API #14

Closed andycasey closed 8 years ago

andycasey commented 8 years ago

Hola,

Is there any (unofficial or otherwise) documentation about the capabilities of the metrics API? Specifically if one were interested in sorting astronomers by some field (e.g., normalised citations), would something like that be available through the metrics API?

At the moment the only way I can see a way to do something like this would be to find highly cited papers (with a wildmask search, sorted by citations), get the names of the authors from the first X papers, then start searching for papers by those names in order to (reasonably) rank top people publishing by their normalised citations. However, for example, if one wanted to know the top 1000 astronomers as ranked by normalised citations, this becomes an expensive exercise.

So, I'm just wondering if the metrics API will have any kind of capabilities like this, or doing something like I propose is the best way forward for the immediate future.

ehenneken commented 8 years ago

Hi Andy

Basically, the metrics API returns the same results as the old API; only the format has been changed a bit. I will update the README for the metrics API soon to document the format.

cheers --Edwin

Edwin Henneken ehenneken@cfa.harvard.edu NASA Astrophysics Data System IT Specialist Harvard - Smithsonian http:// http://adslabs.orgadslabs.org Center for Astrophysics http://ads.harvard.edu 60 Garden St. MS 83, Cambridge, MA 02138 Room P-129

ORCID 0000-0003-4264-2450

On Fri, Sep 4, 2015 at 11:07 AM, Andy Casey notifications@github.com wrote:

Hola,

Is there any (unofficial or otherwise) documentation about the capabilities of the metrics API? Specifically if one were interested in sorting astronomers by some field (e.g., normalised citations), would something like that be available through the metrics API?

At the moment the only way I can see a way to do something like this would be to find highly cited papers (with a wildmask search, sorted by citations), get the names of the authors from the first X papers, then start searching for papers by those names in order to (reasonably) rank top people publishing by their normalised citations. However, for example, if one wanted to know the top 1000 astronomers as ranked by normalised citations, this becomes an expensive exercise.

So, I'm just wondering if the metrics API will have any kind of capabilities like this, or doing something like I propose is the best way forward for the immediate future.

— Reply to this email directly or view it on GitHub https://github.com/adsabs/adsabs-dev-api/issues/14.

andycasey commented 8 years ago

Hey Edwin,

Thanks for that! From reading through https://github.com/adsabs/metrics_service (to refresh myself on the API) it seems that it is easy to retrieve detailed metrics for given bibcodes. However it seems to me that it might be more difficult to aggregate these by authors in order to rank astronomers in the way I described.

For the given example (top ranked N authors by normalised citations) would you say that searching for highly-cited papers, then getting citation metrics for papers published by the authors of those papers, would currently be the most efficient way of compiling such a list?

romanchyla commented 8 years ago

the most efficient way would be to use functional queries - give me until later, i'll try to come up with an example...

do you want to run it against a list of bibcodes/authors? and normalized by the highest citation ocunt?

ehenneken commented 8 years ago

Hi Andy

Any practical way to make it happen, what you propose, can only be achieved expensively. Essentially, it really only makes sense if you either have curated publication lists for those astronomers, or if searching by ORCID has been implemented and OCRCIDs have been assigned. So, if you want to have lists sorted by a certain statistic or indicator, you first need to do a query to get all the papers for a given author and then generate the metrics overview for those records. We are looking into making metics generation more scalable and flexible, but that's still under development.

Note that different disciplines/fields have different citation rates/practices. Percentile based indicators usually are better, and there is also the Tori index, which removes discipline-dependent rates by means of its double normalization.

Whatever way you pick to generate metrics for a given author, you always will have the potential name ambiguity problem (and with authors publishing in both astronomy and physics, chemistry or biology journals, this gets even worse).

ehenneken commented 8 years ago

Andy

With "normalized citation count", I assume you mean the sum of the citations to the papers by a given author, divided by the number of authors of the paper that was cited, correct? So, Edward Witten has a very big normalized citation count, while most people in big collaborations don't.

romanchyla commented 8 years ago

ok, i failed - results of functional queries cannot be faceted, i tried pivot.facets but they are too slow (and for api users will time out); we run solr 4.8 and there is potentially a solution in solr5.0 - to compute stats for individual facets. However, these solutions are unrealistic because they are too slow - it would have to:

  1. search for astrophysics papers (i.e. topn(1000, database:astronomy, citation_count desc)
  2. facet the set by authors
  3. for every author do facet.pivot on citations
  4. compute the stat for result of 3

multivalued fields are very slow

however, the following could get you started - it returns authors of top cited papers in astronomy in year 2015, when you grab the facet (authors), you can then quickly collect metrics for these names (btw: metrics accepts a query, you don't need to search by bibcode only)

q=topn(1000%2C+database%3Aastronomy+AND+year%3A2015%2C+citation_count+desc)&sort=citation_count+desc&fl=bibcode%2Ccitation_count&wt=json&indent=true&facet=true&facet.field=author&facet.mincount=1

however, the usual caveats apply: name could belong to multiple people; the starting criteria are arbitrary (first 1000 papers)

it is not easy to compile the "scientific hitparade"

andycasey commented 8 years ago

Thanks for the ideas on how to deal with this problem! I came up with some code to be able to do this kind of query, or at least approximate what the distribution looks like at the top end. I used @romanchyla 's idea by going from the top cited papers and then searching by authors. My code is not the most efficient query, but it certainly got the job done (and faster than what I expected).

Happy to close this issue if you are. Thanks again!