Exceeded Scopus Search API quota

mpbraendle commented 7 years ago

For large repositories, there may be cases where the Scopus Search API quota is exceeded (HTTP Status code 429). It is 20'000 queries/7 days.

Suggested options:

Distribute the queries on 2 months instead of 1 month.
Rewrite the code to use the Abstract Citations Count API instead (Endpoint http://api.elsevier.com/content/abstract/citation-count, Documentation on http://api.elsevier.com/documentation/AbstractCitationCountAPI.wadl). Requires different API key. Or use a combination of Search API (title, creators, ISBN etc.) and Abstract Citation Count API (Scopus eid, DOI or PMID known). The Abstract Citations Count API has a quota of 50'000 queries/7 days.

phluid61 commented 7 years ago

In the past I believe we did use (or intended to use) one endpoint to find the Scopus eid, and another to look up the actual Citation Count. If that's a way to reduce the quota usage I think it's very worth re-investigating.

I'll have to dig into the history some more, and also look at exactly what changes happened recently when the APIs were upgraded (I'm pretty sure I made those changes, so it shouldn't be that hard to rediscover), and try to reconstruct the business requirements and decisions that lead to them.

We have around 76,000 live records in our repository, so it's quite likely we were about to butt into the 20k/week limit soon, too.

mpbraendle commented 7 years ago

Thank you. I'm currently testing. For the bi-monthly option, also the perldoc documentation must be updated.

phluid61 commented 7 years ago

It is (https://github.com/QUTlib/citation-import/commit/dd5f8d75327e15c0dc5549e6a9897cf66ba0489c). There have been a few commits in the development branch; I only tagged the issue on one of them.

mpbraendle commented 7 years ago

Your code gave several compilation errors: (two lines removed around Findbin, but instead perl header line modified)

Parentheses missing around "my" list at ./update_citationdata line 462. Parentheses missing around "my" list at ./update_citationdata line 509. Operator or semicolon missing before %ids_update at ./update_citationdata line 696. Ambiguous use of % resolved as operator % at ./update_citationdata line 696. Operator or semicolon missing before %ids_update at ./update_citationdata line 699. Ambiguous use of % resolved as operator % at ./update_citationdata line 699. Operator or semicolon missing before %ids_update at ./update_citationdata line 709. Ambiguous use of % resolved as operator % at ./update_citationdata line 709. Operator or semicolon missing before %ids_update at ./update_citationdata line 712. Ambiguous use of % resolved as operator % at ./update_citationdata line 712. Operator or semicolon missing before %ids_update at ./update_citationdata line 720. Ambiguous use of % resolved as operator % at ./update_citationdata line 720. Operator or semicolon missing before %ids_update at ./update_citationdata line 724. Ambiguous use of % resolved as operator % at ./update_citationdata line 724. Operator or semicolon missing before %ids_update at ./update_citationdata line 735. Ambiguous use of % resolved as operator % at ./update_citationdata line 735. Operator or semicolon missing before %ids_update at ./update_citationdata line 738. Ambiguous use of % resolved as operator % at ./update_citationdata line 738. Operator or semicolon missing before %ids_update at ./update_citationdata line 752. Ambiguous use of % resolved as operator % at ./update_citationdata line 752. Operator or semicolon missing before %ids_update at ./update_citationdata line 755. Ambiguous use of % resolved as operator % at ./update_citationdata line 755. Global symbol "$d2" requires explicit package name at ./update_citationdata line 462. Global symbol "$d2" requires explicit package name at ./update_citationdata line 469. Global symbol "$d2" requires explicit package name at ./update_citationdata line 476. Global symbol "$d2" requires explicit package name at ./update_citationdata line 490. Global symbol "$d2" requires explicit package name at ./update_citationdata line 509. Global symbol "$d2" requires explicit package name at ./update_citationdata line 516. Global symbol "$d2" requires explicit package name at ./update_citationdata line 523. Global symbol "$d2" requires explicit package name at ./update_citationdata line 537. syntax error at ./update_citationdata line 804, near "%h{ " Execution of ./update_citationdata aborted due to compilation errors.

I transferred your changes to my version of update_citationdata and removed the errors.

Furthermore, perlcritic gave the following warnings: I/O layer ":utf8" used at line 286, column 10. Use ":encoding(UTF-8)" to get strict validation. (Severity: 5) Two-argument "open" used at line 297, column 5. See page 207 of PBP. (Severity: 5) Subroutine prototypes used at line 804, column 1. See page 194 of PBP. (Severity: 5)

I made some adjustments:

instead of hpush %var, $var2 used hpush(%var, $var2)

new hpush subroutine

sub hpush { my (%h, $k) = @_; $h{$k} = 1; }

Code runs now with sudo -u apache ./update_citationdata archivename scopus --bi-monthly --test

but yields the following warnings/errors: Use of uninitialized value $k in hash element at ./update_citationdata line 815, line 55. Odd number of elements in hash assignment at ./update_citationdata line 814, line 55. . .

What was the idea behind hpush?

With get_eprintids_for_day etc. you get back an array. You pass that as reference to hpush, which is used as key in the hash?

mpbraendle commented 7 years ago

I assume that you want to do like this:

sub test_eprint_selection_algorithm { my ( $session, $dataset, $test_year, $testmonth, $period ) = @;

# get a list of all eprint IDs in the dataset
my $searchexp = EPrints::Search->new(
    session => $session,
    dataset => $dataset,
    custom_order => "eprintid",
);
my $list = $searchexp->perform_search;
my @ids_all = @{$list->ids( 0, $list->count )};
$list->dispose;

if ($bimonthly && $test_month % 2 == 0)
{
    # Feb Apr Jun Aug Oct Dec -- start a month earlier
    $test_month--;
}

# build a list of the eprints to be updated, and sort into ascending order  
my $ids_update;
my $d = Date::Calc::Days_in_Month( $test_year, $test_month );
if ( $period eq "day" )
{
    # get the eprint IDs for every day
    for ( my $test_day = 1; $test_day <= $d; $test_day++ )
    {
        hpush( $ids_update, get_eprintids_for_day( $session, $dataset, $test_year, $test_month, $test_day ));
        if ( $test_day > 1 )
        {
            hpush( $ids_update, get_eprintids_for_yesterday( $session, $dataset, $test_year, $test_month, $test_day ));
        }
    }

    if ($bimonthly)
    {
            $test_month++;
            $d = Date::Calc::Days_in_Month( $test_year, $test_month );

            for ( my $test_day = 1 ; $test_day <= $d ; $test_day++ )
            {
            hpush( $ids_update, get_eprintids_for_day( $session, $dataset, $test_year, $test_month, $test_day));
            if ($test_day > 1)
            {
                hpush( $ids_update, get_eprintids_for_yesterday( $session, $dataset, $test_year, $test_month, $test_day ));
            }
        }
    }

    # add the eprints that were added on the last day of the month
    if ($test_month == 12)
    {
        hpush( $ids_update, get_eprintids_for_yesterday( $session, $dataset, $test_year + 1, 1, 1 ));
    }
    else
    {
        hpush( $ids_update, get_eprintids_for_yesterday( $session, $dataset, $test_year, $test_month + 1, 1 ));
    }

}
else
{
    # get the eprint IDs for every hour
    for ( my $test_day = 1; $test_day <= $d; $test_day++ )
    {
        for ( my $test_hour = 0; $test_hour < 24; $test_hour++ )
        {
            hpush( $ids_update, get_eprintids_for_hour( $session, $dataset, $test_year, $test_month, $test_day, $test_hour ));
            if ( $test_day > 1 || $test_hour > 0 )
            {
                hpush( $ids_update, get_eprintids_for_last_hour( $session, $dataset, $test_year, $test_month, $test_day, $test_hour ));
            }
        }
    }

    if ($bimonthly)
    {
        $test_month++;
        $d = Date::Calc::Days_in_Month( $test_year, $test_month );

            for ( my $test_day = 1 ; $test_day <= $d ; $test_day++ )
            {
            for ( my $test_hour = 0 ; $test_hour < 24 ; $test_hour++ )
            {
                hpush( $ids_update, get_eprintids_for_hour( $session, $dataset, $test_year, $test_month, $test_day, $test_hour ));
                if ( $test_day > 1 || $test_hour > 0 )
                {
                    hpush( $ids_update, 
                    get_eprintids_for_last_hour( $session, $dataset, $test_year, $test_month, $test_day, $test_hour ));
                }
            }
        }
    }
}

my @ids_sorted = sort { $a <=> $b } keys %{$ids_update};

# compare the two lists
my $n_all = scalar @ids_all;
my $n_update = scalar @ids_sorted;
my $i = 0;
print "Update\tAll\n" if $session->{noise} > 1;
while ( $i < $n_all && $i < $n_update && $ids_sorted[$i] == $ids_all[$i] )
{
    print $ids_sorted[$i] . "\t" . $ids_all[$i] ."\n" if $session->{noise} > 1;
    $i++;
}
if ( $i == $n_all && $n_all == $n_update )
{
    print "Test successful.\n" if $session->{noise} > 0;
    return 0;
}
elsif ( $i == $n_all && $n_all != $n_update )
{
    print "The update list ($n_update elements) continues beyond the end of the live IDs ($n_all elements)." if $session->{noise} > 0;
    return 1;
}
elsif ( $i == $n_update && $n_all != $n_update )
{
    print "The update list ($n_update elements) does not contain all of the live IDs ($n_all elements)." if $session->{noise} > 0;
    return 1;
}
else
{
    print $ids_sorted[$i] . "\t" . $ids_all[$i] . "\n" if $session->{noise} > 1;
    print "Test failed at position $i.\n" if $session->{noise} > 0;
    return 1;
}

}

sub hpush { my ( $h, @ids ) = @_;

foreach my $id (@ids)
{
    $h->{$id} = 1;
}

return;

}

phluid61 commented 7 years ago

Thanks for running through this. I'm afraid my branch discipline has been lax, and I've mangled two in-progress changes into the development branch. On one hand is the attempt to realign with the eprintsug fork, and on the other is this ongoing development. All not helped by the fact that our running version of this code is in an unrelated repository, with unrelated (and currently incompatible) changes, so I've not actually had a chance to invoke it yet. :pensive:

The idea of hpush was just a hacky way to turn ids_update from a list to a set -- changing to bimonthly added some duplicates to the list, and this was the easiest/laziest way to solve that.

@a = ();
push @a, 1, 2, 3, 1;
@a; #=> (1, 2, 3, 1)

%h = ();
hpush %h, 1, 2, 3, 1;
keys %h; #=> (1, 2, 3)

I'll fix up the things you've pointed out, and try to get perlcritic running on my new development machine.

phluid61 commented 7 years ago

This has bubbled to the surface for us now, because we've been getting a lot of GENERAL_SYSTEM_ERROR responses from Scopus, so we've been retrying a lot of queries, so we've run out of quota.

I'll start working on this with a much higher priority, including looking into the second API. I'm also considering reworking the whole approach, distinguishing records-with-a-scopus-eid and those without, and running them in different ways on different schedules.

QUTlib / citation-import

Exceeded Scopus Search API quota #27