gbhl / bhl-europe

Biodiversity Heritage Library Europe
http://www.bhl-europe.eu/
15 stars 2 forks source link

Ingest Content for Content Provider - DILIBRI #326

Open lobajuluwa opened 12 years ago

lobajuluwa commented 12 years ago

Task description: Align (DILIBRI) upload data/structure with ingest tool needs

Ingest (DILIBRI) data

Actions to take:

  1. Script to Harmonise metadata (current mods is not valid)
chris-sleep commented 12 years ago

Assess vs FSG: for content in de-dilibri/333569

Metadata: - oai_dc Directory Structure: Monograph structure, ok

Filenaming: files named with [incremental sequence]-[second incremental sequence].tif

first sequence looks to be image sequence, second same+directory no no page numbering or type data present.

02_333572.tif 
03_333573.tif  
05_333575.tif  
06_333576.tif  
08_333578.tif
chris-sleep commented 12 years ago

No reference to this CP in content management wiki page:

@wkollernhm is the metadata good to work with (.. mostly which parameters needed for oai_dc SMT?) @melitabirthaelmer as only 2 books uploaded so far, have we potential for filename updates at source?

wkollernhm commented 12 years ago

oai_dc is not good. Pure DC works perfectly fine but not wrapped into an OAI-PMH response.

Please ask CP to upload "pure" DC records!

chris-sleep commented 12 years ago

dilibri website has various metadata formats listed in archive, and page type/number data associated with served content. Have emailed dilibri technical contact to investigate how this can be uploaded to us.

chris-sleep commented 12 years ago

we can have mods or marc xml metadata.

for page level metadata can also have mets with structure blocks holding number and type data;

wkollernhm commented 12 years ago

We need a single format for all levels. Maybe they can provide a sample for the METS data?

If mods or marc-xml for the bibliographic information does not matter - both are fine.

chris-sleep commented 12 years ago

metadata uploaded

mod.xml for each monograph mets.xml for each monograph

testing PI to validate the mods on de-dilibri/333569

chris-sleep commented 12 years ago

@wkollernhm can you take a quick look please? When I invoke the PI, the output reports:

Executing Schema Mapper... /mnt/nfs/dev/jdk1.6.0_24/bin/java -jar /var/www/schema-mapping-tool/cli/dist/smt-cli.jar -m c -cm 6 -if "/mnt/nfs/upload/providers/de-dilibri/333569/333569_mets.xml" -of "/mnt/nfs/upload/providers/de-dilibri/333569/.aip/olef.xml"

Return Code: 1 (Starting conversion of MODS to OLEF...)

however - the xml source should be 333569_mods.xml - is there any way to pass a name pattern to the schema mapping to match a specific .xml source? (using -m c -cm 6 -if -of )

chris-sleep commented 12 years ago

uploaded mods and mets are not valid: missing <?xml .....> declaration and encoding data missing namespace prefixes for mods/mets xsi and xlink tag mismatch (for sample 333569) - not closed

(note that smt will write olef from manually corrected mods.xml ok)

this is likely an artifact of the xml coming from the oai_dc wrapped source; that xml holds the missing declarations in the outer wrapper. 118 xml files so simples option will be to rewrite the first couple of lines by script

chris-sleep commented 12 years ago

@ZhengLIAtos - I've generated a .aip for de-dilibri/333569 for test purposes (handcrafted the mods for the SMT step), can you please run a test ingest on this and verify if the package would be good? @wkollernhm

zhengl commented 12 years ago

ingested. (not yet transformed by Access)

chris-sleep commented 12 years ago

@ZhengLIAtos Thanks - can I ask a question in ignorance; what does not yet transformed by Access mean?

(I'm guessing - does this mean that the base document has reached fedora, but the transformation is needed to push it to the portal?)

chris-sleep commented 12 years ago

@ZhengLIAtos I've rerun the ingest test (hopefully olef is now good) on bhl-test for the content de-dilibri//333569-test

this now looks to be in integration fedora as bhle:10706-a0wwpzkp but not yet indexed; can you please take a look and see if all looks good to you? (the ingest.log says status=completed, so is it just a matter of giving it time before checking the portal? fedora status is also marked active but I don't yet see the data in solr)