PerseusDL / canonical

This will be the base repo for all text and annotation data published in the PDL
16 stars 17 forks source link

2006.05.0178.xml Richmond Dispatch #36

Open lcerrato opened 10 years ago

lcerrato commented 10 years ago

0000982: missing section of Richmond Dispatch I'm not sure why this is happening, but sections of the Richmond Dispatch (at least one edition I am viewing) are not appearing on line. A user searched on the content, and it was found by Google, but you can't actually see it outside of the xml file. This may be intentional but it seems strange.

User searched on "Robert A. S. Pittman , of the ship James Guthrie and Miss Ada V. Saunders" which Google returns as the Perseus XML document: http://www.perseus.tufts.edu/hopper/dltext?doc=Perseus%3Atext%3A2006.05.0178 [^]

In the XML, this reference appears in , but it does not appear on the online version. (Searches in this doc for Pittman, Saunders, etc, turn up empty).

everything from sentence 368 to 395 is not visible online

I would guess this happens with other editions of the Richmond Dispatch, but I'm not sure how to pinpoint this problem

This is happening because the text is chunked by article, and the text in question is not included in an enclosing article chunk.

I'm not sure how widely this applies to the Richmond collection, but in this particular file, the article chunks are at the div3 level, along with ad-blank. Anything that is in the higher div2 or div1 elements is getting omitted from the display. The div1 level includes types such as "page-image", "subscription","notices", "news". The div2 elements include various subcategories under each of those.

Articles are in any of the following paths: /div1[@type='news' or @type='notices']/div2[@type='morning' or @type='evening' or @type='local' or @type='negroe' or @type='wants' or @type='servants' or @type='announcements' or @type='telegraphic' or @type='negro' or @type='slaves']

The undisplayed text above is in the following path /div1[@type='notices']/div2[@type='advertisements']/div3[@type='ad-blank']

If the entire text is meant to be displayed then we would need to fix the chunking scheme and possibly cleanup the data as well.