fcla / xmlresolution

A web service that recursively finds schemas associated with an XML file
daitss/xmlresoution
GNU General Public License v3.0
4 stars 1 forks source link

bad status #30

Open childree opened 12 years ago

childree commented 12 years ago

When looking at the XMLresolution log we can see the number of times these errors have occurred:

$ grep " 500" xmlresolution.log

2012 Oct 14 10:22:11 fclnx30 XmlResolution[9179]:  INFO xmlresolution.fda.fcla.edu: Rack:     128.***.***.*** - - [14/Oct/2012 10:22:11] "GET /ieids/E20110314_AAABJG/ " 500 24 0.0682
2012 Oct 14 14:27:02 fclnx30 XmlResolution[9177]:  INFO xmlresolution.fda.fcla.edu: Rack:     128.***.***.*** - - [14/Oct/2012 14:27:02] "GET /ieids/E20110315_AAAAMP/ " 500 24 ...etc...

Thrown Errors:

Number of Cases: 112 Action(s): GET, POST

I'm currently working on more statistical information to include in this ticket that may help troubleshoot the issue.

Thank you, Jen

iterman commented 12 years ago

tar_writer.rb is trying to access a deleted schema. Code will be developed to prevent this situation.

lydiam commented 12 years ago

This is a new bug, according to Ira, that happens when a schema gets deleted and the next package needs that schema and tries to read it. It happens with GETs. The solution is to retain the schemas directory and only delete/recreate it when xmlresolution is restarted.

Jen sees the logs reflecting the problem on both GETs and POSTs. Ira will look for POSTs in the log.

lydiam commented 12 years ago

Carol suggests doing an xmlresolution code review in the near future.

childree commented 12 years ago

I'm adding further analysis into this issue to this ticket due to the number that occurred today alone.

ONLY the following schema are what XMLresolution is attempting to GET/POST when this error is thrown:

01490ebdea13c1bc82a17e4783daeeaa
1fadeaf88d4b93ab263f7c59917c26bc
2b2f6040cfc603d5873d7fa0bf976274
42519c72a741cc30e256b99369f1d735
447039d87705b9734e4fad11295eaa0b
534d7d1e9b53ece0bf0f5874444d8bcb
5e0bd6f94ec78a3a88fca2275ab05f9e
712fc5a7750e69f904f61086a997713c
7f0fd51a2a1490bbd68e5a68e7fc1738
99a72e44689e334ea8b851260347cf8e
d90774b02fa694f3b358b4ed828295be

Of this list, I still need to identify what these schema are:

447039d87705b9734e4fad11295eaa0b
534d7d1e9b53ece0bf0f5874444d8bcb
7f0fd51a2a1490bbd68e5a68e7fc1738
99a72e44689e334ea8b851260347cf8e

The following schema have been identified as follows:

01490ebdea13c1bc82a17e4783daeeaa = xlink.xsd 
1fadeaf88d4b93ab263f7c59917c26bc = XMLSchema.xsd
2b2f6040cfc603d5873d7fa0bf976274 = daitss.xsd
42519c72a741cc30e256b99369f1d735 = mets.xsd
5e0bd6f94ec78a3a88fca2275ab05f9e = xml.xsd (2001/03)
712fc5a7750e69f904f61086a997713c = xml.xsd (2001)
d90774b02fa694f3b358b4ed828295be = simpledc20021212.xsd

Analysis:

Key: FAILURE= The schema worked on at time of failure SUCCESS= The schema that was successful TEST= My manual curl of the schema and calculation of md5sum to validate the SUCCESS

01490ebdea13c1bc82a17e4783daeeaa: FAILURE:<schema md5="01490ebdea13c1bc82a17e4783daeeaa" last_modified="2007-08-23T15:02:01-04:00" namespace="http://www.w3.org/1999/xlink" location="http://www.loc.gov/standards/xlink/xlink.xsd" status="success"/>

SUCCESS: <schema md5="6bdc7f9459a502964f889d70a335cece" last_modified="2007-08-23T15:02:01-04:00" namespace="http://www.w3.org/1999/xlink" location="http://www.loc.gov/standards/xlink/xlink.xsd" status="success"/>

TEST: $ curl http://www.loc.gov/standards/xlink/xlink.xsd > xlink.xsd $ md5sum xlink.xsd $ 6bdc7f9459a502964f889d70a335cece xlink.xsd

1fadeaf88d4b93ab263f7c59917c26bc FAILURE: <schema md5="1fadeaf88d4b93ab263f7c59917c26bc" last_modified="2004-03-20T07:53:09-05:00" namespace="http://www.w3.org/2001/XMLSchema" location="http://www.w3.org/2001/XMLSchema.xsd" status="success"/>
SUCCESS: <schema md5="94ed1a93ce3147d01bcb2fc1126255ed" last_modified="2004-03-20T07:53:09-05:00" namespace="http://www.w3.org/2001/XMLSchema" location="http://www.w3.org/2001/XMLSchema.xsd" status="success"/> TEST: $ curl http://www.w3.org/2001/XMLSchema.xsd > XMLSchema.xsd $ md5sum XMLSchema.xsd $ 94ed1a93ce3147d01bcb2fc1126255ed XMLSchema.xsd

2b2f6040cfc603d5873d7fa0bf976274 FAILURE: <schema md5="2b2f6040cfc603d5873d7fa0bf976274" last_modified="2012-05-30T14:05:46-04:00" namespace="http://www.fcla.edu/dls/md/daitss/" location="http://www.fcla.edu/dls/md/daitss/daitss.xsd" status="success"/> SUCCESS: <schema md5="a2aa0a4a13503457317d2a94a4e8b038" last_modified="2012-05-30T14:05:46-04:00" namespace="http://www.fcla.edu/dls/md/daitss/" location="http://www.fcla.edu/dls/md/daitss/daitss.xsd" status="success"/> TEST: $ curl http://www.fcla.edu/dls/md/daitss/daitss.xsd > daitss.xsd $ md5sum daitss.xsd $ a2aa0a4a13503457317d2a94a4e8b038 daitss.xsd

42519c72a741cc30e256b99369f1d735 FAILURE: <schema md5="42519c72a741cc30e256b99369f1d735" last_modified="2012-03-05T12:02:18-05:00" namespace="http://www.loc.gov/METS/" location="http://www.loc.gov/standards/mets/mets.xsd" status="success"/> SUCCESS: <schema md5="b8a3efa3d4a9ae8918f4abb1f53bc08f" last_modified="2012-03-05T12:02:18-05:00" namespace="http://www.loc.gov/METS/" location="http://www.loc.gov/standards/mets/mets.xsd" status="success"/> TEST: $ curl http://www.loc.gov/standards/mets/mets.xsd > mets.xsd $ md5sum mets.xsd $ b8a3efa3d4a9ae8918f4abb1f53bc08f mets.xsd

5e0bd6f94ec78a3a88fca2275ab05f9e FAILURE: <schema md5="5e0bd6f94ec78a3a88fca2275ab05f9e" last_modified="2009-01-21T17:06:40-05:00" namespace="http://www.w3.org/XML/1998/namespace" location="http://www.w3.org/2001/xml.xsd" status="success"/> SUCCESS: <schema md5="bf97e27bdd02f7031a8a71ea4d229daf" last_modified="2009-01-21T17:06:40-05:00" namespace="http://www.w3.org/XML/1998/namespace" location="http://www.w3.org/2001/xml.xsd" status="success"/> TEST: $ curl http://www.w3.org/2001/xml.xsd > xml.xsd $ md5sum xml.xsd $ bf97e27bdd02f7031a8a71ea4d229daf xml.xsd

712fc5a7750e69f904f61086a997713c FAILURE:<schema md5="712fc5a7750e69f904f61086a997713c" last_modified="2004-03-31T12:57:18-05:00" namespace="http://www.w3.org/XML/1998/namespace" location="http://www.w3.org/2001/03/xml.xsd" status="success"/> SUCCESS:<schema md5="2e2cf9072dc058dcda41b7ee77a5cb54" last_modified="2004-03-31T12:57:18-05:00" namespace="http://www.w3.org/XML/1998/namespace" location="http://www.w3.org/2001/03/xml.xsd" status="success"/> TEST: $ curl http://www.w3.org/2001/03/xml.xsd > xml.xsd $ md5sum xml.xsd $ 2e2cf9072dc058dcda41b7ee77a5cb54 xml.xsd

d90774b02fa694f3b358b4ed828295be FAILURE: <schema md5="d90774b02fa694f3b358b4ed828295be" last_modified="2012-08-21T17:14:26-04:00" namespace="http://purl.org/dc/elements/1.1/" location="http://dublincore.org/schemas/xmls/simpledc20021212.xsd" status="success"/> SUCCESS: <schema md5="afd985136a7e721cfafa062287a27f45" last_modified="2012-08-23T15:33:48-04:00" namespace="http://purl.org/dc/elements/1.1/" location="http://dublincore.org/schemas/xmls/simpledc20021212.xsd" status="success"/> TEST: $ curl http://dublincore.org/schemas/xmls/simpledc20021212.xsd > simpledc20021212.xsd $ md5sum simpledc20021212.xsd $ afd985136a7e721cfafa062287a27f45 simpledc20021212.xsd

Conclusion:

It would appear that XMLresolution is somehow obtaining an outdated copy of schema or is in some way corrupting those that are failing. However, since the identical FAILURE md5sum is seen multiple times throughout the xmlresolution.log files, I would say that XMLresolution is not getting the most recent schema, intermittently. We need to look at our cache and squid to determine how old schema are being pulled down.

lydiam commented 12 years ago

The FAILURE checksums appear to be the ones computed on the schema name, and the SUCCESS checksums are the ones computed on the contents.

childree commented 11 years ago

Yes, so perhaps as we speculated Thursday evening, there may be two versions of the code running that is causing this issue to crop up and not that the schema are corrupted or incorrect. I think the next step will be to identify the exact code that is performing this method and correct it to the hash value of the filename and content combined, as described in #14.

childree commented 11 years ago

Actually, I'm not quite sure this is true. Let's use simpledc20021212.xsd as an example. Here is the original information from above about this schema:

d90774b02fa694f3b358b4ed828295be FAILURE: <schema md5="d90774b02fa694f3b358b4ed828295be" last_modified="2012-08-21T17:14:26-04:00" namespace="http://purl.org/dc/elements/1.1/" location="http://dublincore.org/schemas/xmls/simpledc20021212.xsd" status="success"/> SUCCESS: <schema md5="afd985136a7e721cfafa062287a27f45" last_modified="2012-08-23T15:33:48-04:00" namespace="http://purl.org/dc/elements/1.1/" location="http://dublincore.org/schemas/xmls/simpledc20021212.xsd" status="success"/> TEST: $ curl http://dublincore.org/schemas/xmls/simpledc20021212.xsd > simpledc20021212.xsd $ md5sum simpledc20021212.xsd $ afd985136a7e721cfafa062287a27f45 simpledc20021212.xsd

The md5sum of the content is:
$ md5sum simpledc20021212.xsd $ afd985136a7e721cfafa062287a27f45 simpledc20021212.xsd

The md5sum of the schema name string is: $ echo -n simpledc20021212.xsd | md5sum $ 049eb3434affd13dd872b188588ec7af -

Even if we include the newline, the md5sum of the schema name string is: $ echo simpledc20021212.xsd | md5sum $ 21cdc0bb3bfcc584db89e639d53411f1 -

We're still left with not knowing how d90774b02fa694f3b358b4ed828295be was derived. As it turns out, this md5sum is the entire string of the location of the schema: $ echo -n http://dublincore.org/schemas/xmls/simpledc20021212.xsd | md5sum $ d90774b02fa694f3b358b4ed828295be -

Perhaps this was known but I find it interesting that the entire location of the schema is being used.

childree commented 11 years ago

It appears I've misunderstood what should be happening. In production, the md5sum of the schema should be calculated on the URL string and is stored in the manifest.xml file within the xmlres-* directory of the AIP. Apologies for the confusion.

lydiam commented 11 years ago

Per this morning's meeting: it appears that this problem was caused by code implemented relating to #17, where schemas are deleted when no collection references the schema.