RJP43 / CitySlaveGirls

The Restoration of Nell Nelson
http://nelson.newtfire.org
5 stars 4 forks source link

Data for *Possible* Mapping #60

Open RJP43 opened 8 years ago

RJP43 commented 8 years ago

@ghbondar

Here are the links to the two files that contain all of our place references within 1888-Chicago.

This one gives you companies and their addresses : http://dxcvm05.psc.edu:8080/exist/rest/db/Nelson/NellCompaniesAddresses.tsv

This one is just the places that we call "local references" : http://dxcvm05.psc.edu:8080/exist/rest/db/Nelson/NellLocalPlaces.text

This issue would probably be the best place to post any results since everyone on the team has access.

Thank you so much!

RJP43 commented 8 years ago

UGH! The data is correct as far as being able to start finding places and all of the companies are associated correctly. The bug is that some companies are not appearing, and I am not really sure why? It might be a Site Index issue or could be XML related, but I have spent so much time this weekend on Nelson I have to walk away and get other work done. If @spadafour or @ebeshero want to look at my queries that produced the two data sets you can find them at these path steps in eXide:

/db/rParker/nelsonLocRefs --- gets the list of all the local reference and I think is working to get all of them

/db/rParker/nelsonPlacesAndCompanies --- not giving us all of the companies that are in the Site Index that should all be linked to <orgName type="exposedCompany"> in the articles by the @ref xml:id (for example the two generically named tailor shops which have addresses and ids in the site index and articles aren't showing up.) Another issue is on the output there are two companies being grabbed that don't have addresses, and the company names aren't appearing

ebeshero commented 8 years ago

@RJP43 I'll take a look right now and see if I can debug your nelsonPlacesAndCompanies query...more in a few minutes

ebeshero commented 8 years ago

@RJP43 I've concluded that the two entries coming up as "No Address" must be non-hits in your site index. That is, I think you have two values of $i in your for loop running over the distinct values of @ref attributes of orgNames in your aticles that are not correctly matching @xml:id values in your site index. Here's why I'm about 99% sure this is so:

1) I ran a separate XQuery over just your site index to make sure I could pull up orgNames with descendant elements and get their full text content all the way down the tree with string(). I certainly could. I also tested to see if I could do that when those same orgNames had nothing encoded in <placeName>. I certainly could.

2) I expanded your $compName variable to cover every eventuality in your site index that I could think of, like so:

for $i in $distCompsArt
let $expCompSI := $allOrgsSI[@xml:id = $i](:orgs in site index that match exposed companies in articles:)
let $compName := 
if ($expCompSI[descendant::tei:orgName[1]/text()]) then $expCompSI//tei:orgName[1]/string()
else if ($expCompSI//*[./text()]) then $expCompSI//*[./text()][1]/string()
else $expCompSI/@xml:id/string()

The last else here is the key to my conclusion here: That should pull just the @xml:id if there is a match. There apparently isn't a match, so in these cases the value of $expCompSi must be empty!

I tested that one more time, by just setting this:

let $compName := $expCompSI/@xml:id/string()

and I returned much the same output: just the xml:ids in place of the company names, and the last two entries missing with "No Address".

I conclude that the problem is in the markup of the articles. Do you have a Schematron checking each of the articles to make sure their @ref attributes are pointing to your xml:ids? I thought you did, but perhaps that Schema line is missing or you didn't notice it was firing project validation errors. You can load any well-formed XML document into eXist, so you might have loaded an article or two with something breaking your project schema rules and eXist wouldn't have a problem with it. Go check. @spadafour @RJP43

RJP43 commented 8 years ago

We have several thing in our Schematron to check for this kind of messy data and all of our XMLs are green in oxygen these are the rules we have for these things:

 <let name="si" value="doc('siteIndex.xml')//@xml:id"/>
    <pattern>
        <rule context="@ref|@resp|@corresp|@who|tei:w[@type='noun']/@ana|tei:w[@type='poss']/@ana|tei:rdg/@wit">
            <let name="tokens" value="for $i in tokenize(., '\s+') return substring-after($i,'#')"/>
            <assert test="every $token in $tokens satisfies $token = $si">The attribute (after the hashtag, #) must match a defined @xml:id in the Site Index file!</assert>
        </rule>
    </pattern>

This rule is doing the checking into our site index so that each of those attributes verify there is a site index entry that corresponds. This rule fires correctly, and was tested when I added the new archetypes and company ids before I got them into the Site Index.

    <pattern>
        <rule context="tei:text//tei:placeName">
            <report test="not(@type)">Element 'placeName' must contain @type.</report>
        </rule>
    </pattern>

    <pattern>
        <rule context="tei:text//tei:placeName">
            <assert
                test="@type = ('address','locRef','country','state','city')">@type may only be: address, locRef (location reference), city, state, or country.</assert>
        </rule>
    </pattern>

    <pattern>
        <rule context=" tei:text//tei:placeName">
            <report test="@type='address' and not(@ref)">Addresses must have a corresponding @ref.</report>
        </rule>
    </pattern>

    <pattern>
        <rule context=" tei:text//tei:placeName">
            <report test="not(@type='address') and @ref">Only addresses have a corresponding @ref.</report>
        </rule>
    </pattern>

    <pattern>
        <rule context="tei:orgName">
            <report test="not(@ref)">Element 'orgName' must contain @ref.</report>
        </rule>
    </pattern>

These are all the rules we have controlling addresses and organizations.

Do you see any errors with these or a way @spadafour and I can rework or add rules to grab the issues you are suggesting are in the markup?

ghbondar commented 8 years ago

@RJP43 @spadafour Here is a link to an 1898 panoramic map of central Chicago... streets are mostly still the same as today, so you can find a location using Google Maps to get a general idea where it is, and then locate it on this map, or other maps that you can Google using "Chicago 1888 maps" or some such... https://www.loc.gov/resource/g4104c.pm001530/

ghbondar commented 8 years ago

@RJP43 1875 streetmap of chicago: https://upload.wikimedia.org/wikipedia/commons/3/31/Chicago-warner-beers-1875.jpg

ghbondar commented 8 years ago

@RJP43 This looks good: https://www.newberry.org/chicago-neighborhood-guide

RJP43 commented 8 years ago

@spadafour and I fixed the issue we were having with blank orgNames and it was indeed an XML issue! Thanks @ebeshero for checking this and leading us in that direction!

Here is the link to the new data: http://dxcvm05.psc.edu:8080/exist/rest/db/Nelson/NellCompaniesAddresses.tsv

ebeshero commented 8 years ago

@RJP43 Congrats on a successful debugging! @spadafour Also, try importing your street addresses here, the GPS visualizer, to see if it spits back latitude and longitude coordinates for each...there are a variety of ways to output: try KML as well as plain text:

http://www.gpsvisualizer.com/geocoder/