MarcosSueiro / nypr-archives-ingest-scripts

A series of xslt stylesheets that transform xml output from exiftool or BWF metaedit, merge data from several sources, and generate xml and html files for ingest into NYPR Archives' various systems
2 stars 2 forks source link

Grab additional contributors from CMS #117

Open MarcosSueiro opened 1 year ago

MarcosSueiro commented 1 year ago

Not all contributors are listed as "appearances" in the Publisher API (noted by Dylan --thank you!).

A. Articles and episodes seem to display additional contributors and tags in the html section <pageMap>, which is commented out. Example: https://www.wnyc.org/story/trump-indicted-again/ (view the page source to see the <pageMap> section)

B. Other stories (such as segments) may not present additional contributors at all, and the link can only be seen in internal.wnyc.org (it does not seem to show in csv exports from the CMS either). However, contributors (guests) are often in bold in the body; they may also appear as tags such as john-doe or _johndoe. Example: http://www.wnycstudios.org/story/109322-exiled-president-baby-doc-returns-to-haiti/. (See also https://internal.wnyc.org/admin/cms/segment/109322/).

Steps in case (A):

  1. Load an html page using its slug as unparsed-text
  2. Parse out the section <pageMap> (commented out)
  3. Convert this section to xml
  4. Parse out contributors by selecting<DataObject type="person">
  5. Use the slug listed under @name='id'
  6. Use the Publisher API to obtain the additional contributor info

Potential steps in case (B):

  1. Parse out the Bold names.
  2. Parse out twitter-handle links, e.g. <a href="https://twitter.com/walterolson">Walter Olson</a>
  3. Use something like ParseContributors.xsl to extract potential Firstname Lastname combinations (could be slow).
  4. Check if tags such as "john-doe" or "john_doe" exist as https://www.wnyc.org/people/john-doe (not sure what this might prove)

This is in the source code of https://www.wnyc.org/story/trump-indicted-again/:

<!--
    <PageMap>
      <DataObject type="date">
        <Attribute name="display" value="Jun 09, 2023"/>
        <Attribute name="sort" value="20230609"/>
      </DataObject>
        <DataObject type="tag">
          <Attribute name="id">gop</Attribute>
        </DataObject>
        <DataObject type="tag">
          <Attribute name="id">maga</Attribute>
        </DataObject>
        <DataObject type="tag">
          <Attribute name="id">new_york_republican_party</Attribute>
        </DataObject>
        <DataObject type="tag">
          <Attribute name="id">politics</Attribute>
        </DataObject>
        <DataObject type="tag">
          <Attribute name="id">trump</Attribute>
        </DataObject>
        <DataObject type="person">
          <Attribute name="id">david-freedlander</Attribute>
        </DataObject>
    </PageMap>
    -->