Letractively / rdflib

Automatically exported from code.google.com/p/rdflib
Other
0 stars 0 forks source link

RDFa Parser Update #79

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
The RDFa Parser has been replaced by integrated pyRdfa code (we have permission 
to do so by Ivan 
Herman). The previous parser was quite old and didn't conform to the 
recommendation. (I also got 
an ok by Elias Torres who made the initial implementation.)

I have completed an initial integration of this and committed it to trunk (rev 
1691). These are the 
changes in short:

 * Removed "rdflib/syntax/parsers/RDFaParser.py".
 * Added new new "rdflib/syntax/parsers/rdfa/" *package* (new RDFaParser in "__init__.py").
 * Adapted "rdflib/plugin.py" to use new parser.

The code is mostly verbatim copied from pyRdfa, with some minor cleanup and 
documentation 
adaptation. This may need further work (doc format, code conventions), but that 
might be better to 
do in a coordinated effort (to encompass more parts in need of cleanup)?

Tests have been adapted:

 * Removed old "test/test_rdfa.py" and "test/{ntriples,rdfdiff}.py" (only used by old test).
 * Emptied current "test/rdfa" and added a new test module along with a subdir for a copy of the 
W3C RDFa testsuite.

Furthermore: the class "IsomorphicTestableGraph" has been moved from 
"test_sparql/BisonSPARQLParser/test.py" to a new module "rdflib.graphutils", 
and renamed to 
"IsomorphicGraph". Does anyone have a better name and/or location for that?

(.. Note: IsomorphicGraph is currently used to make up for sparql bugs which 
cause some tests to 
fail where they should not. As of this change, all rdfa tests pass.)

(.. Also note that the "BisonSPARQLParser/test.py" didn't work before and still 
doesn't.)

I would like to have this change reviewed.

Original issue reported on code.google.com by lindstr...@gmail.com on 6 Aug 2009 at 11:09

GoogleCodeExporter commented 9 years ago
The changes are in r1691 -- I'll take a look.

Original comment by eik...@gmail.com on 7 Aug 2009 at 6:48

GoogleCodeExporter commented 9 years ago
It looks like the html5lib usage is failing, when the source isn't valid xhtml:

>>> from rdflib.graph import ConjunctiveGraph
>>> g = ConjunctiveGraph()
>>> g.parse(location='http://oreilly.com/catalog/9781565926288/', 
format='rdfa', 
lax=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ed/Projects/rdflib/rdflib/graph.py", line 985, in parse
    location=location, file=file, data=data, **args)
  File "/home/ed/Projects/rdflib/rdflib/graph.py", line 785, in parse
    parser.parse(source, self, **args)
  File "/home/ed/Projects/rdflib/rdflib/syntax/parsers/rdfa/__init__.py", line 170, in 
parse
    dom = _try_process_source(stream, options)
  File "/home/ed/Projects/rdflib/rdflib/syntax/parsers/rdfa/__init__.py", line 245, in 
_try_process_source
    parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
NameError: global name 'html5lib' is not defined

Incidentally this URL works ok using the RDFa Distiller, which is a service 
based on 
pyRDFa:

  http://www.w3.org/2007/08/pyRdfa/extract?
url=http://oreilly.com/catalog/9781565926288/

Original comment by ed.summers on 18 Dec 2009 at 7:28

GoogleCodeExporter commented 9 years ago
Forgot to mention that I do have html5lib installed ...

Original comment by ed.summers on 18 Dec 2009 at 7:30

GoogleCodeExporter commented 9 years ago
Since the RDFa parser update is in and mostly working I'd like to close this 
ticket. Any reason to keep this ticket 
open or any tickets we should create from this one?

Original comment by eik...@gmail.com on 2 Feb 2010 at 9:15

GoogleCodeExporter commented 9 years ago
I agree. It's stable, tested and both used and improved on by others now. 
Anything odd 
popping up would warrant new, specific tickets. Marked as Fixed.

Original comment by lindstr...@gmail.com on 3 Feb 2010 at 5:41

GoogleCodeExporter commented 9 years ago
So should we create tickets for failing RDFa test suite tests? You can run the 
test 
suite with run_tests.py in trunk...These are the ones that fail, and are most 
of the 
test failures that remain:

TC #11
TC #92 
TC #94
TC #100
TC #101
TC #102
TC #103
TC #114
TC #117

Original comment by ed.summers on 3 Feb 2010 at 7:28

GoogleCodeExporter commented 9 years ago
Please create individual tickets for each one - they fail for various mysterious
reasons, then we have somewhere to discuss. 

Original comment by gromgull on 3 Feb 2010 at 7:37

GoogleCodeExporter commented 9 years ago
And another thing, there is always the N3 test trick, where we just moved the 
tests
that fail to another folder, i.e. n3 folder and broken_parse_test folder under 
test.

Original comment by gromgull on 3 Feb 2010 at 7:39

GoogleCodeExporter commented 9 years ago
I tried to annotate my commits, but didn't get the format right. So commenting 
here. In r1766 and r1767 I 
moved all the notation3 parsing bits into the notation3 module ridding us of 
the non lower case module named 
N3Parser. Also moved all the rdfxml parsing into a module with of that name 
removing a couple non lower case 
module names.

Original comment by eik...@gmail.com on 3 Feb 2010 at 7:43

GoogleCodeExporter commented 9 years ago
See also r1765 for the notation3 related module shuffling update.

Original comment by eik...@gmail.com on 3 Feb 2010 at 7:45