curationexperts / goldenseal

WUSTL hydra repo
http://www.curationexperts.com
Other
2 stars 4 forks source link

Improve importer performance #196

Closed mark-dce closed 8 years ago

mark-dce commented 9 years ago

ISSUE The time to attach an file to a work appears to be proportional to the number of nodes in the repo.

STEP1 Ran an import for one of the Lewald novels, initial page attach time was 4-6 seconds, growing to 12-15 seconds by the time there were 70+ pages loaded. https://gist.github.com/mark-dce/1502670590a1f0cd8b07

STEP 2 Delete the work created by that import. This left about 200 ACL nodes in the repo

STEP 3 Re-import the same work. Initial page times are 6-8 seconds, growing to 22-25 seconds by the time there are around 200 pages loaded https://gist.github.com/mark-dce/bdfbffaa8996d233f363

jcoyne commented 9 years ago

Using Fedora 4.4 should fix this. See https://github.com/fcrepo4/fcrepo4/issues/884

mark-dce commented 9 years ago

Import finishes: https://gist.github.com/mark-dce/00e71cdfa44797008df5

jcoyne commented 9 years ago

Blocked by the work Trey Pendragon is doing.

jcoyne commented 9 years ago

After working on this two days, I believe I'm just about back to where we started from (19s after 100 records in).

https://gist.github.com/jcoyne/d6cb696123d6cc19de3f

The only remaining thing to do would be to install the raptor library on the server and we could use librdfraptor.

jcoyne commented 9 years ago

Work I did related to this ticket:

https://github.com/projecthydra-labs/hydra-pcdm/pull/194 https://github.com/projecthydra/active_fedora/pull/925 https://github.com/projecthydra/active_fedora/pull/926 https://github.com/projecthydra/active_fedora/pull/930 https://github.com/projecthydra-labs/activefedora-aggregation/pull/96 https://github.com/projecthydra-labs/activefedora-aggregation/pull/98 https://github.com/projecthydra-labs/activefedora-aggregation/pull/100 https://github.com/projecthydra-labs/activefedora-aggregation/pull/101 0f1149c6d405073630abbb64cc27fe0da1f9fb20

jcoyne commented 9 years ago

Now we're at 8-9s/item after 70 items imported, 32s/item after 200. Is this workable?

mark-dce commented 9 years ago

I think we need to be under 60, and ideally under 30 after 1500 items.

Let's set a goal of ingesting the 1663 pages for Von Geschlecht zu Geschlecht (lew1871.0001.002.xml) in under 12 hours, but definitely no more than 24.

jcoyne commented 9 years ago

Not going to happen at the current growth rate (exponential). We'd need to make changes to Fedora and probably use a different graph parser.

mark-dce commented 9 years ago

Ok, what is the minimum amount of time that will be required. The SOW says we will be able to import this collection - we need to set expectations about how long that will take, especially in this case.

(As an aside, have we turned off ordering the file attachments?)

jcoyne commented 9 years ago

If you turn off the ordering, I don't think it will display correctly because curation_concerns expects ordered file_sets.

mark-dce commented 9 years ago

Can we turn off ordering for the import and then assert (build) the ordering at the end of the import - instead of doing it each item one at a time?

jcoyne commented 9 years ago

I'll give it a try.

mark-dce commented 9 years ago

Looking at the log from last week, it looks like we got the whole thing imported in about 11 hours (~66 seconds per page for the final pages). Is it unfair of me to hope that we can at least continue to achieve that?

from 2015-10-20-194400-import.log

20:10:00 Parsing lew1871.0001.002.xml
    20:10:01 attaching file: lew1871.0001.002.xml
    20:10:03 attaching file: gesc_01_0001_unm.tif
    20:10:05 attaching file: gesc_01_0002_ttl.tif
    20:10:08 attaching file: gesc_01_0003_unm.tif
    20:10:11 attaching file: gesc_01_0004_unm.tif

    07:06:19 attaching file: gesc_04_1660_364.tif
    07:07:24 attaching file: gesc_04_1661_365.tif
    07:08:30 attaching file: gesc_04_1662_366.tif
    07:09:35 attaching file: gesc_04_1663_bln.tif
jcoyne commented 9 years ago

We ingested all the gesc_*.tif yesterday. Saving all the files is pretty fast. Saving the order takes 6hrs.

mark-dce commented 9 years ago

YAY! That's well under my goal of 12 hours - I could live with that easily - it is a huge book.

ASIDE Though long-run it would obviously be cool for ingesting the actual tiffs to be the expensive thing, not setting their order :)

_If all of this is about giving curation_concerns an oder for the file list, is there any way to refactor curationconcerns to just use the create timestamp sort order - there's no way in the UI for me to change the order right now, so isn't that the same thing?

Can a PCDM file be attached to multiple PCDM works? If not, you really could just store the order on the file (or just serialize an order array on the parent...)

jcoyne commented 9 years ago

More improvements: https://github.com/ActiveTriples/ActiveTriples/pull/169 https://github.com/ruby-rdf/rdf/pull/229 https://github.com/ruby-rdf/rdf/pull/230

mark-dce commented 8 years ago

Seems like the importer performance is vastly improved at this point and we can import text and video works in a workable amount of time. Closing this issue and will open a new ticket under a new SOW if there are specific import targets that need to be met.