Closed mark-dce closed 8 years ago
Using Fedora 4.4 should fix this. See https://github.com/fcrepo4/fcrepo4/issues/884
Import finishes: https://gist.github.com/mark-dce/00e71cdfa44797008df5
Blocked by the work Trey Pendragon is doing.
After working on this two days, I believe I'm just about back to where we started from (19s after 100 records in).
https://gist.github.com/jcoyne/d6cb696123d6cc19de3f
The only remaining thing to do would be to install the raptor library on the server and we could use librdfraptor.
Work I did related to this ticket:
https://github.com/projecthydra-labs/hydra-pcdm/pull/194 https://github.com/projecthydra/active_fedora/pull/925 https://github.com/projecthydra/active_fedora/pull/926 https://github.com/projecthydra/active_fedora/pull/930 https://github.com/projecthydra-labs/activefedora-aggregation/pull/96 https://github.com/projecthydra-labs/activefedora-aggregation/pull/98 https://github.com/projecthydra-labs/activefedora-aggregation/pull/100 https://github.com/projecthydra-labs/activefedora-aggregation/pull/101 0f1149c6d405073630abbb64cc27fe0da1f9fb20
Now we're at 8-9s/item after 70 items imported, 32s/item after 200. Is this workable?
I think we need to be under 60, and ideally under 30 after 1500 items.
Let's set a goal of ingesting the 1663 pages for Von Geschlecht zu Geschlecht (lew1871.0001.002.xml) in under 12 hours, but definitely no more than 24.
Not going to happen at the current growth rate (exponential). We'd need to make changes to Fedora and probably use a different graph parser.
Ok, what is the minimum amount of time that will be required. The SOW says we will be able to import this collection - we need to set expectations about how long that will take, especially in this case.
(As an aside, have we turned off ordering the file attachments?)
If you turn off the ordering, I don't think it will display correctly because curation_concerns expects ordered file_sets.
Can we turn off ordering for the import and then assert (build) the ordering at the end of the import - instead of doing it each item one at a time?
I'll give it a try.
Looking at the log from last week, it looks like we got the whole thing imported in about 11 hours (~66 seconds per page for the final pages). Is it unfair of me to hope that we can at least continue to achieve that?
from 2015-10-20-194400-import.log
20:10:00 Parsing lew1871.0001.002.xml
20:10:01 attaching file: lew1871.0001.002.xml
20:10:03 attaching file: gesc_01_0001_unm.tif
20:10:05 attaching file: gesc_01_0002_ttl.tif
20:10:08 attaching file: gesc_01_0003_unm.tif
20:10:11 attaching file: gesc_01_0004_unm.tif
07:06:19 attaching file: gesc_04_1660_364.tif
07:07:24 attaching file: gesc_04_1661_365.tif
07:08:30 attaching file: gesc_04_1662_366.tif
07:09:35 attaching file: gesc_04_1663_bln.tif
We ingested all the gesc_*.tif yesterday. Saving all the files is pretty fast. Saving the order takes 6hrs.
YAY! That's well under my goal of 12 hours - I could live with that easily - it is a huge book.
ASIDE Though long-run it would obviously be cool for ingesting the actual tiffs to be the expensive thing, not setting their order :)
_If all of this is about giving curation_concerns an oder for the file list, is there any way to refactor curationconcerns to just use the create timestamp sort order - there's no way in the UI for me to change the order right now, so isn't that the same thing?
Can a PCDM file be attached to multiple PCDM works? If not, you really could just store the order on the file (or just serialize an order array on the parent...)
Seems like the importer performance is vastly improved at this point and we can import text and video works in a workable amount of time. Closing this issue and will open a new ticket under a new SOW if there are specific import targets that need to be met.
ISSUE The time to attach an file to a work appears to be proportional to the number of nodes in the repo.
STEP1 Ran an import for one of the Lewald novels, initial page attach time was 4-6 seconds, growing to 12-15 seconds by the time there were 70+ pages loaded. https://gist.github.com/mark-dce/1502670590a1f0cd8b07
STEP 2 Delete the work created by that import. This left about 200 ACL nodes in the repo
STEP 3 Re-import the same work. Initial page times are 6-8 seconds, growing to 22-25 seconds by the time there are around 200 pages loaded https://gist.github.com/mark-dce/bdfbffaa8996d233f363