TheStanfordDaily / archives-web

Helper functions and web app for METS/ALTO archive viewing.
https://archives.stanforddaily.com
6 stars 2 forks source link

Method for searching for article overlays #3

Closed hesyifei closed 5 years ago

hesyifei commented 5 years ago

In METS file, first traverse "Physical Structure" to get all pages (e.g. ALTO00001). e.g. (https://s3.amazonaws.com/stanforddailyarchive/data.2013-nov/data/stanford/1999/12/01_01/Stanford_Daily_19991201_0001-METS.xml)

<structMap LABEL="Physical Structure" TYPE="PHYSICAL">
        <div ID="DIVP1" LABEL="The Stanford Daily" TYPE="Newspaper" DMDID="MODSMD_PRINT MODSMD_ELEC">
            <div ID="DIVP2" ORDER="1" ORDERLABEL="1A" TYPE="COVER_PAGE">
                <fptr>
                    <par>
                        <area FILEID="IMG00001"/>
                        <area FILEID="ALTO00001" BETYPE="IDREF" BEGIN="P1"/>
                    </par>
                </fptr>
            </div>
...

Then traverse the whole file to find all elements with attribute e.g. [FILEID="ALTO00001"]. Then find their parents that has TYPE="ARTICLE" and add this as part of the overlays for that parent.

e.g.

                    <div ID="DIVL10" TYPE="CONTENT">
                        <div ID="DIVL11" TYPE="ARTICLE" DMDID="MODSMD_ARTICLE1" LABEL="Earthquakes rock Stanford in '06, '89">
                            <div ID="DIVL12" TYPE="HEADING">
                                <div ID="DIVL13" TYPE="TITLE">
                                    <fptr>
                                        <area BETYPE="IDREF" FILEID="ALTO00001" BEGIN="P1_TB00006"/>
                                    </fptr>
                                </div>
                                <div ID="DIVL14" TYPE="AUTHOR">
                                    <fptr>
                                        <area BETYPE="IDREF" FILEID="ALTO00001" BEGIN="P1_TB00007"/>
                                    </fptr>
                                </div>
                            </div>

It will add P1_TB00007 and P1_TB00006 to overlay for MODSMD_ARTICLE1.

Then find corresponding positions and size in the ALTO file. (e.g. https://s3.amazonaws.com/stanforddailyarchive/data.2013-nov/data/stanford/1999/12/01_01/Stanford_Daily-ALTO/Stanford_Daily_19991201_0001_ALTO0001.xml) https://github.com/TheStanfordDaily/archives-web/blob/8a48f383e4a239d6bec7dd98e77a175c0e2b02fb/src/classes/Page.js#L23-L43

@epicfaace do you think there's any easier way? Also, do we need to highlight anything other than TYPE="ARTICLE"? (e.g. TYPE="TITLE_SECTION" and TYPE="ADVERTISEMENT")