Schema parser struggles with additional namespaces

Themanwithoutaplan commented 9 years ago

The Office OpenXML schemas are spread out across multiple files. parse_schema_file seems to struggle with the various namespaces in use. It also struggles with the encoding declaration of the file which is weird, because lxml doesn't when I read it. I wonder if that's because it's using fromstring(file.read()) rather than parse(file, parser)?

schema = parse_schema_file("openpyxl/tests/schemas/sml.xsd")
Traceback (most recent call last):
  File "/Applications/WingIDE.app/Contents/Resources/src/debug/tserver/_sandbox.py", line 1, in <module>
    # Used internally for debug sandbox under external interpreter
  File "/Users/charlieclark/Projects/openpyxl/lib/python3.4/site-packages/spyne/util/xml.py", line 153, in parse_schema_file
    .parse_schema(elt)
  File "/Users/charlieclark/Projects/openpyxl/lib/python3.4/site-packages/spyne/interface/xml_schema/parser.py", line 545, in parse_schema
    file_name = self.files[imp.namespace]
builtins.KeyError: 'http://schemas.openxmlformats.org/officeDocument/2006/relationships'

Themanwithoutaplan commented 9 years ago

Sorry, I'm too stupid to work out how to add a file. You can get schemas from the specification you need Part 1 from http://www.ecma-international.org/publications/standards/Ecma-376.htm

plq commented 9 years ago

So that wasn't a bug after all. You need to pass a namespace-to-filename map to parse_schema_file for it to import schemas.

Here's what I did to your gist: https://gist.github.com/plq/202269b57bae168d9563

I'd think twice before clicking on it though -- clone it instead from https://gist.github.com/202269b57bae168d9563.git There's a parse.py in there that shows how it should work.

Currently it chokes on an attribute definition. I'm looking into it.

plq commented 9 years ago

That's fixed as of 32579ee567e32dae7510b0bdfc5531f70080e2ed. Please try again with a full namespace map.

Themanwithoutaplan commented 9 years ago

Thanks, but even after completing the the namespace:file map I'm getting some errors:

DEBUG:spyne.interface.xml_schema.parser:%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Traceback (most recent call last):
  File "parse.py", line 24, in <module>
    parsed_schema = parse_schema_file('sml.xsd', files=ns_file_map)['some_ns']
  File "/Users/charlieclark/temp/spyne/spyne/spyne/util/xml.py", line 173, in parse_schema_file
    .parse_schema(elt)
  File "/Users/charlieclark/temp/spyne/spyne/spyne/interface/xml_schema/parser.py", line 593, in parse_schema
    c.print_pending(fail=True)
  File "/Users/charlieclark/temp/spyne/spyne/spyne/interface/xml_schema/parser.py", line 513, in print_pending
    raise Exception("there are still unresolved elements")
Exception: there are still unresolved elements

From what I can see the output looks pretty interesting though I still need to work out how I'd use a different class design for some of the stuff I'm hoping to do. But anything is better than reinventing the wheel!

plq commented 9 years ago

can you commit that so I can have a look at it?

plq commented 9 years ago

or paste here, I dunno

Themanwithoutaplan commented 9 years ago

ns_file_map = {
    'http://purl.oclc.org/ooxml/officeDocument/relationships': 'shared-relationshipReference.xsd',
    'http://purl.oclc.org/ooxml/officeDocument/sharedTypes':
    'shared-commonSimpleTypes.xsd',
    'http://purl.oclc.org/ooxml/drawingml/spreadsheetDrawing':
    'dml-spreadsheetDrawing.xsd',
    'http://purl.oclc.org/ooxml/drawingml/main':
    'dml-main.xsd',
    'http://purl.oclc.org/ooxml/drawingml/diagram':
    'dml-diagram.xsd',
    'http://purl.oclc.org/ooxml/drawingml/chart':
    'dml-chartDrawing.xsd',
    'http://purl.oclc.org/ooxml/drawingml/picture':
    'dml-picture.xsd',
    'http://purl.oclc.org/ooxml/drawingml/lockedCanvas':
    'dml-lockedCanvas.xsd',
}

plq commented 9 years ago

if you check the logs, you'll see that that's because not all of simpleType is implemented. I've just implemented <xs:list>. <xs:union> remains. I'm not sure when I can do that.

Just parsing it is not enough, you must implement serialization and deserialization in protocol/xml.py as well, because otherwise defaults can't be read.

see: 4e37d9fc7c49795e6134246caec1b88bc2551b89

plq commented 9 years ago

Please have a look at the parser.py and the code itself. The error you're getting is not fatal, it just informs you that spyne had to drop some types. So if you pass force_full_parse=False to parse_schema_* you will have access to what's already parsed.

plq commented 9 years ago

I filed #422 and #423 as next steps to this issue. Patches are welcome.

Themanwithoutaplan commented 9 years ago

Thanks very much for the information and the tips. Will look at the code when I've some time and may even submit some patches, if someone will hold my hand while I use git!

I'm currently interested in the "shape" of the Python classes generated in terms of the API they provide for developers. It would be great if I could dump my own hand-rolled code in favour of your more extensive and reliable but I'll still want to change what get's generated. I'll provide more information on the follow up issues but thank you again for your help so far. It's very much appreciated.

plq commented 9 years ago

Glad to be of help.

If the schemas are set in stone, I see no harm in modifying the generated code. The generator is also at a very nascent stage (just 100 lines at this point) so it's OK to shape it to your needs.

Themanwithoutaplan commented 9 years ago

Sorry, this is probably down to my lack of knowledge about git but I did a pull (you don't need an update like hg, right?) and then I got

from spyne import BODY_STYLE_BARE, BODY_STYLE_WRAPPED, BODY_STYLE_EMPTY
ImportError: cannot import name BODY_STYLE_BARE

plq commented 9 years ago

wot. cd ..; rm -rf spyne; git clone git://github.com/arskom/spyne

Themanwithoutaplan commented 9 years ago

Weird, I'd cloned it only last night. Seems to be running now with the force_full_parse option (skip_errors might be a better name for this option). I then got a key error due to "some_ns" – presumably I just check the keys of the returned schema? And then I get start looking for individual types.

plq commented 9 years ago

yes, you need to put the targetNamespace of the schema you want there.

plq commented 9 years ago

1274a46372a778b8c8812f49cde7dd9290ae9dd5

Themanwithoutaplan commented 9 years ago

Okay, got that far myself. What do I need to do to get the class "definition"? I thought I saw a method for that somewhere.

I'm just playing around at the moment but if I get once of the generated classes it doesn't seem be enforcing any of the constraints.

To get an idea of what I'm looking to do you might to look at one of the classes I've created in openpyxl. Starting initially with descriptors to enforce constraints I've added some stuff to the base class and metaclass.

https://bitbucket.org/openpyxl/openpyxl/src/6b884f3f47f66358aa5c86f0e4fb6afabfb70c60/openpyxl/styles/fonts.py?at=2.2#cl-13

This is an example of a terribly designed bit of the spec with unnecessary nesting of elements instead of attributes, downright cryptic names because of abbreviation and general nastiness if you want to interact with it as a programmer.

As I hope you can see from the create and serialise methods we're both working along very similar lines.

plq commented 9 years ago

Okay, got that far myself. What do I need to do to get the class "definition"? I thought I saw a method for that somewhere.

It's there in the xml example I sent you earlier.

I'm just playing around at the moment but if I get once of the generated classes it doesn't seem be enforcing any of the constraints.

lxml enforces the contstraints, spyne only generates the schema (and once validated, deserializes the document) . have a look at xml protocol's validate_lxml function and XmlSchema class in spyne.interface.

plq commented 9 years ago

As I hope you can see from the create and serialise methods we're both working along very similar lines.

I disagree. You're doing everything manually :)

Themanwithoutaplan commented 9 years ago

haha, not any more. I've started working with the first generated code to test some of the ideas – nesting works quite nicely. We can't wait for validation at serialisation time and lxml is also not a hard dependency in the project, so a user needs a TypeError when creating an instance manually – this is what the library is for.

Some of the silliness in the current code (I deliberately showed one example which has weird behaviour) is to keep generated XML similar to what other programs do.

plq commented 9 years ago

Ah, now I understand what you want.

You only need to write your version of quick and dirty genpy.py to generate your type of class definitions from schema data. Then you can forego both lxml and spyne as a dependency.

plq commented 9 years ago

Unless, of course, you don't want to implement validation-on-assignment for Spyne. It's something I wanted to look at for some time. Then your efforts would be useful to a broader audience outside of openpyxl. Your choice, of course.

Themanwithoutaplan commented 9 years ago

Basically, yes. lxml is a test requirement and I'd have no trouble making Spyne an optional one for development, especially with the way it handles all the relevant schema. So a command line might be something like python classify.py CT_AreaSer > AreaSer.py which would generate the whole caboodle of relevant imports and classes in order for this particularly nasty bit of a particular nasty bit of the schema, that itself is peculiarly nasty.

Happy to contribute to the project where I can because I'm sure others will find it useful but I'm still finding my feet in it. Meeting Eric (the other project maintainer) at FOSDEM this weekend and hope to be able to discuss it with him. Got to be an improvement on the largely procedural code we inherited from the initial port from PHP which spreads parser, API and writer code liberally across the project.

Themanwithoutaplan commented 9 years ago

The descriptors we use to implement typing should be reusable in any project (based on a cookbook recipe). See https://bitbucket.org/openpyxl/openpyxl/src/6b884f3f47f66358aa5c86f0e4fb6afabfb70c60/openpyxl/descriptors/base.py?at=2.2

We make types first level objects because of the convenience when coding. From what I've seen of your code you stay closer to the XML but expected_type=… should be usable. Let me know if that would be useful and I can look at integrating it.

When it comes to generating code we have an additional flag for nested elements. These are child elements in the schema that can almost always be better represented as attributes. I guess I can just take gen_py as a base and swap out the base class.

plq commented 9 years ago

Spyne's got its own version of expected_type, namely ModelBase.Value.

e.g.: https://github.com/plq/spyne/blob/1ffea2faff07b1219ab4b54f4e059003bce4849f/spyne/model/primitive.py#L303

These are not enforced at all, though. I'd be happy to make them an optional part of Spyne. I guess that'd change validation code dramatically, but I'm not afraid :)

What do you think about Python 3? I no longer accept Python 3-incompatible code, and the number of tests that fail under Python 3 are supposed to be declining. Spyne's xml parts already work in Python 3 and I wouldn't want this to change as we already advertise it.

Themanwithoutaplan commented 9 years ago

We support 2.6, 2.7, 3.3 and 3.4. Python 3 syntax is the standard and the compatibility imports are minimal. Apparently, io.BytesIO is slow on Python 2.6 because it uses StringIO and not cStringIO in the background but apart from that I've not heard of any real problems. Things get hairier if you want to keep support for 2.5 and earlier. 3.2 is just a pain because of the lack of support for the unicode literal which means you will get confusing failures.

Themanwithoutaplan commented 9 years ago

I haven't worked out how your nested classes work but the type hierarchy is basically the same. We use __set__() for validation on assignment where you have staticmethods.

plq commented 9 years ago

Any nesting is done via ComplexModelBase.

You'll probably have to come up with a ComplexModelValidatedOnAssignmentBase that has the necessary descriptor setup to validate values. From there you'll hookup to the usual spyne machinery and use genpy.py to your heart's content

plq commented 9 years ago

btw, re: Python 3, I'm also supporting 2.6, 2.7 and 3.3+

Themanwithoutaplan commented 9 years ago

Doing my own version of ComplexModelBase looks like it would be a good topic for a sprint! Are you going to be at PyCon?

plq commented 9 years ago

There's now an initial implementation for validation-on-assignment. You need to pass voa to the type customization line.

Themanwithoutaplan commented 9 years ago

Thanks for the update. With now over 200 classes based on my own metaclass I won't be switching to Spyne for that but I might revisit the generator. Gave mention of Spyne during my talk at PyCon France.

arskom / spyne

Schema parser struggles with additional namespaces #420