Open mpbraendle opened 7 years ago
In the past I believe we did use (or intended to use) one endpoint to find the Scopus eid, and another to look up the actual Citation Count. If that's a way to reduce the quota usage I think it's very worth re-investigating.
I'll have to dig into the history some more, and also look at exactly what changes happened recently when the APIs were upgraded (I'm pretty sure I made those changes, so it shouldn't be that hard to rediscover), and try to reconstruct the business requirements and decisions that lead to them.
We have around 76,000 live records in our repository, so it's quite likely we were about to butt into the 20k/week limit soon, too.
Thank you. I'm currently testing. For the bi-monthly option, also the perldoc documentation must be updated.
It is (https://github.com/QUTlib/citation-import/commit/dd5f8d75327e15c0dc5549e6a9897cf66ba0489c). There have been a few commits in the development branch; I only tagged the issue on one of them.
Your code gave several compilation errors: (two lines removed around Findbin, but instead perl header line modified)
Parentheses missing around "my" list at ./update_citationdata line 462. Parentheses missing around "my" list at ./update_citationdata line 509. Operator or semicolon missing before %ids_update at ./update_citationdata line 696. Ambiguous use of % resolved as operator % at ./update_citationdata line 696. Operator or semicolon missing before %ids_update at ./update_citationdata line 699. Ambiguous use of % resolved as operator % at ./update_citationdata line 699. Operator or semicolon missing before %ids_update at ./update_citationdata line 709. Ambiguous use of % resolved as operator % at ./update_citationdata line 709. Operator or semicolon missing before %ids_update at ./update_citationdata line 712. Ambiguous use of % resolved as operator % at ./update_citationdata line 712. Operator or semicolon missing before %ids_update at ./update_citationdata line 720. Ambiguous use of % resolved as operator % at ./update_citationdata line 720. Operator or semicolon missing before %ids_update at ./update_citationdata line 724. Ambiguous use of % resolved as operator % at ./update_citationdata line 724. Operator or semicolon missing before %ids_update at ./update_citationdata line 735. Ambiguous use of % resolved as operator % at ./update_citationdata line 735. Operator or semicolon missing before %ids_update at ./update_citationdata line 738. Ambiguous use of % resolved as operator % at ./update_citationdata line 738. Operator or semicolon missing before %ids_update at ./update_citationdata line 752. Ambiguous use of % resolved as operator % at ./update_citationdata line 752. Operator or semicolon missing before %ids_update at ./update_citationdata line 755. Ambiguous use of % resolved as operator % at ./update_citationdata line 755. Global symbol "$d2" requires explicit package name at ./update_citationdata line 462. Global symbol "$d2" requires explicit package name at ./update_citationdata line 469. Global symbol "$d2" requires explicit package name at ./update_citationdata line 476. Global symbol "$d2" requires explicit package name at ./update_citationdata line 490. Global symbol "$d2" requires explicit package name at ./update_citationdata line 509. Global symbol "$d2" requires explicit package name at ./update_citationdata line 516. Global symbol "$d2" requires explicit package name at ./update_citationdata line 523. Global symbol "$d2" requires explicit package name at ./update_citationdata line 537. syntax error at ./update_citationdata line 804, near "%h{ " Execution of ./update_citationdata aborted due to compilation errors.
I transferred your changes to my version of update_citationdata and removed the errors.
Furthermore, perlcritic gave the following warnings: I/O layer ":utf8" used at line 286, column 10. Use ":encoding(UTF-8)" to get strict validation. (Severity: 5) Two-argument "open" used at line 297, column 5. See page 207 of PBP. (Severity: 5) Subroutine prototypes used at line 804, column 1. See page 194 of PBP. (Severity: 5)
I made some adjustments:
instead of hpush %var, $var2 used hpush(%var, $var2)
new hpush subroutine
sub hpush { my (%h, $k) = @_; $h{$k} = 1; }
Code runs now with sudo -u apache ./update_citationdata archivename scopus --bi-monthly --test
but yields the following warnings/errors: Use of uninitialized value $k in hash element at ./update_citationdata line 815, line 55. Odd number of elements in hash assignment at ./update_citationdata line 814, line 55. . .
What was the idea behind hpush?
With get_eprintids_for_day etc. you get back an array. You pass that as reference to hpush, which is used as key in the hash?
I assume that you want to do like this:
sub test_eprint_selection_algorithm { my ( $session, $dataset, $test_year, $testmonth, $period ) = @;
# get a list of all eprint IDs in the dataset
my $searchexp = EPrints::Search->new(
session => $session,
dataset => $dataset,
custom_order => "eprintid",
);
my $list = $searchexp->perform_search;
my @ids_all = @{$list->ids( 0, $list->count )};
$list->dispose;
if ($bimonthly && $test_month % 2 == 0)
{
# Feb Apr Jun Aug Oct Dec -- start a month earlier
$test_month--;
}
# build a list of the eprints to be updated, and sort into ascending order
my $ids_update;
my $d = Date::Calc::Days_in_Month( $test_year, $test_month );
if ( $period eq "day" )
{
# get the eprint IDs for every day
for ( my $test_day = 1; $test_day <= $d; $test_day++ )
{
hpush( $ids_update, get_eprintids_for_day( $session, $dataset, $test_year, $test_month, $test_day ));
if ( $test_day > 1 )
{
hpush( $ids_update, get_eprintids_for_yesterday( $session, $dataset, $test_year, $test_month, $test_day ));
}
}
if ($bimonthly)
{
$test_month++;
$d = Date::Calc::Days_in_Month( $test_year, $test_month );
for ( my $test_day = 1 ; $test_day <= $d ; $test_day++ )
{
hpush( $ids_update, get_eprintids_for_day( $session, $dataset, $test_year, $test_month, $test_day));
if ($test_day > 1)
{
hpush( $ids_update, get_eprintids_for_yesterday( $session, $dataset, $test_year, $test_month, $test_day ));
}
}
}
# add the eprints that were added on the last day of the month
if ($test_month == 12)
{
hpush( $ids_update, get_eprintids_for_yesterday( $session, $dataset, $test_year + 1, 1, 1 ));
}
else
{
hpush( $ids_update, get_eprintids_for_yesterday( $session, $dataset, $test_year, $test_month + 1, 1 ));
}
}
else
{
# get the eprint IDs for every hour
for ( my $test_day = 1; $test_day <= $d; $test_day++ )
{
for ( my $test_hour = 0; $test_hour < 24; $test_hour++ )
{
hpush( $ids_update, get_eprintids_for_hour( $session, $dataset, $test_year, $test_month, $test_day, $test_hour ));
if ( $test_day > 1 || $test_hour > 0 )
{
hpush( $ids_update, get_eprintids_for_last_hour( $session, $dataset, $test_year, $test_month, $test_day, $test_hour ));
}
}
}
if ($bimonthly)
{
$test_month++;
$d = Date::Calc::Days_in_Month( $test_year, $test_month );
for ( my $test_day = 1 ; $test_day <= $d ; $test_day++ )
{
for ( my $test_hour = 0 ; $test_hour < 24 ; $test_hour++ )
{
hpush( $ids_update, get_eprintids_for_hour( $session, $dataset, $test_year, $test_month, $test_day, $test_hour ));
if ( $test_day > 1 || $test_hour > 0 )
{
hpush( $ids_update,
get_eprintids_for_last_hour( $session, $dataset, $test_year, $test_month, $test_day, $test_hour ));
}
}
}
}
}
my @ids_sorted = sort { $a <=> $b } keys %{$ids_update};
# compare the two lists
my $n_all = scalar @ids_all;
my $n_update = scalar @ids_sorted;
my $i = 0;
print "Update\tAll\n" if $session->{noise} > 1;
while ( $i < $n_all && $i < $n_update && $ids_sorted[$i] == $ids_all[$i] )
{
print $ids_sorted[$i] . "\t" . $ids_all[$i] ."\n" if $session->{noise} > 1;
$i++;
}
if ( $i == $n_all && $n_all == $n_update )
{
print "Test successful.\n" if $session->{noise} > 0;
return 0;
}
elsif ( $i == $n_all && $n_all != $n_update )
{
print "The update list ($n_update elements) continues beyond the end of the live IDs ($n_all elements)." if $session->{noise} > 0;
return 1;
}
elsif ( $i == $n_update && $n_all != $n_update )
{
print "The update list ($n_update elements) does not contain all of the live IDs ($n_all elements)." if $session->{noise} > 0;
return 1;
}
else
{
print $ids_sorted[$i] . "\t" . $ids_all[$i] . "\n" if $session->{noise} > 1;
print "Test failed at position $i.\n" if $session->{noise} > 0;
return 1;
}
}
sub hpush { my ( $h, @ids ) = @_;
foreach my $id (@ids)
{
$h->{$id} = 1;
}
return;
}
Thanks for running through this. I'm afraid my branch discipline has been lax, and I've mangled two in-progress changes into the development branch. On one hand is the attempt to realign with the eprintsug fork, and on the other is this ongoing development. All not helped by the fact that our running version of this code is in an unrelated repository, with unrelated (and currently incompatible) changes, so I've not actually had a chance to invoke it yet. :pensive:
The idea of hpush was just a hacky way to turn ids_update
from a list to a set -- changing to bimonthly added some duplicates to the list, and this was the easiest/laziest way to solve that.
@a = ();
push @a, 1, 2, 3, 1;
@a; #=> (1, 2, 3, 1)
%h = ();
hpush %h, 1, 2, 3, 1;
keys %h; #=> (1, 2, 3)
I'll fix up the things you've pointed out, and try to get perlcritic running on my new development machine.
This has bubbled to the surface for us now, because we've been getting a lot of GENERAL_SYSTEM_ERROR responses from Scopus, so we've been retrying a lot of queries, so we've run out of quota.
I'll start working on this with a much higher priority, including looking into the second API. I'm also considering reworking the whole approach, distinguishing records-with-a-scopus-eid and those without, and running them in different ways on different schedules.
For large repositories, there may be cases where the Scopus Search API quota is exceeded (HTTP Status code 429). It is 20'000 queries/7 days.
Suggested options: