calvez / xcoaitoolkit

Automatically exported from code.google.com/p/xcoaitoolkit
0 stars 0 forks source link

Harvest slow-down #3

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
This has (likely) been a bug since the first version of the OAI Toolkit, as
Eric experienced a similar problem with the first version of the OAI
Toolkit.  See Peter's e-mail below for description:

From: Peter Kiraly [mailto:pkiraly@tesuji.eu] 
Sent: Wednesday, December 17, 2008 12:16 PM
To: Dibelius, Steven; Cook, Randall; Lindahl, David
Cc: Osisek, Eric
Subject: Re: checking on how things are going?

Hi all,

I got a dilemma. Today I start to see the server side. I started to
harvesting all data and from the log information I measure the response
time. I set up the server to serve max 5000 records or max 10 MB in
one response and I did not setup the gziped response in Tomcat.
I use the OCLC harvester2 for the harvesting.

From the results I created an excel table and a graphic chart, thats
what I attached (sorry I don't know how to export the chart as image,
so this image is screen captured, but I hope, that you can read the labels.)

The chart shows, that almost every fraction of response time are constants
(creatint Java object, transforming XSLT, building DOM), except the read
from Lucene. This value was constantly increasing to a given point, and
it means that the response time was longer and longer, and the lesser and
lesser records can be harvestable as times go by (up to the given point).
Since then the response time is constantly about 26 sec.

The number of harvested records:
in 1st hour: 1 326 000
in 2nd hour: 1 084 000
in 3rd hour:   706 000
in 4th hour:   614 000
in 5th hour:   586 000

As I saw the source it is a Lucene thing, and the problem, is that there is
no offset and limit clauses in the Lucene queries, and we should skip to the
given Nth record, when we want to read that. I have some ideas, how to do some
tricks to prevent this behavior, but I did not tested these, and if they fail,
I should read listservs and documentations how to solve it.

The point is, that if I would start this process, I can't estimate how long
does it take to find out the solution, and there is a possibility, that we
should postpone the release.

I don't see, that it is too critical, since the harvesting is still speedy:
about 10 000 records per minutes, 600 000 per hours - but I think, that not
I am the best person to decide, that it is speedy enough.

So the question: do you think, that I should concentrate to the release, or
I should a spend some hours to find a quick fix this behaviour, and if I
found something I should incorporate, if I fall we release as it is, with
notion in "known bugs" (with this option I think, we have still have the
opportunity to release the 0.4 this year).

Péter

Original issue reported on code.google.com by shreyans...@gmail.com on 12 Mar 2009 at 7:22

GoogleCodeExporter commented 9 years ago

Original comment by sva...@library.rochester.edu on 12 May 2009 at 12:25

GoogleCodeExporter commented 9 years ago

Original comment by sva...@library.rochester.edu on 19 May 2009 at 7:49

GoogleCodeExporter commented 9 years ago
The bug has been fixed, released and is incorporated in the 0.6 version of the 
OAI 
Toolkit.

Original comment by sva...@library.rochester.edu on 26 May 2009 at 7:25