gbhl / bhl-europe

Biodiversity Heritage Library Europe
http://www.bhl-europe.eu/
15 stars 2 forks source link

Ingest Content for Content Provider - BNF #336

Open lobajuluwa opened 12 years ago

lobajuluwa commented 12 years ago

Task description: Align (BNF) upload data/structure with ingest tool needs

subtask: Ingest (BNF) data

/mnt/nfs-demeter/upload/providers/fr-bnf/

522 folders in a folder called Consultation - his the content to be ingested since it contains more complete content and it is newer than Main!!!

mnt/nfs-demeter/upload/providers/fr-bnf/Consultation/

522 folders in Main with identical items of PERIODIQUEs = Journals but is older than Consultation

mnt/nfs-demeter/upload/providers/fr-bnf/Main/

670 folders in second-batch/BHL with MONOGRAPHIEs an PERIODIQUEs

/mnt/nfs-demeter/upload/providers/fr-bnf/second-batch/BHL/

Actions to take:

  1. ask CP: "which is the folder to be ingested fr-bnf/Consultation/ or fr-bnf/Main/?"
  2. accumulate journals and monographs under the series they belong to if possible
    • X${IDENTIFIER}.XML ffor peroiodicals contains the series title in the field , volume specific data in and
  3. copy relevant files and direcoties into a new root folder which will contain the harmonized structure
  4. Summary:

    • folder names & folder structure: not compliant to the FSG, has currently the following structure:
    • current structure
    • ./A - thumbnail jpgs
    • ./C - medium jpg
    • ./D - full res jpg
    • ./T - full res tiff !!!!
    • ./X - OCR result: text with positional information in xml (http://bibnum.bnf.fr/ns/alto_prod)
    • ./D${IDENTIFIER}.tif - title page an b/w duplex tiff
    • ./I${IDENTIFIER}.IDX - index file ??
    • ./O${IDENTIFIER}.OFF - binary file
    • ./X${IDENTIFIER}.XML - Metadata on bibliographic item and on OCR process
    • ./T${IDENTIFIER}.XML - table of content
    • target structure: different structures required for monographs and journals
    • series level: TODO
    • volume level: TODO
    • item level: TODO
    • file names: TODO
    • InternalIdentifier: TODO
    • FileSequenceNumber: TODO
    • PageType: TODO
    • PrintedPageNumber: TODO
    • medatada available: OK (title, author, type, editor, publishing date, page level information like pagetype [I, N, T ???])
    • metadata in accepted format: X${IDENTIFIER}.XML - Metadata on bibliographic item and on OCR process OK
    • Bibliographic level: Journals & Monographs
akohlbecker commented 12 years ago

@chris-sleep i am now starting to review these uploads

akohlbecker commented 12 years ago

review done