datamade / probablepeople

:family: a python library for parsing unstructured western names into name components.
http://parserator.datamade.us/probablepeople
MIT License
593 stars 71 forks source link

XML comments cause error in parsenator train #72

Open az0 opened 6 years ago

az0 commented 6 years ago

The error message can be seen in Travis build 99

Traceback (most recent call last):
  File "/home/travis/virtualenv/python2.7.14/bin/parserator", line 11, in <module>
    sys.exit(dispatch())
  File "/home/travis/virtualenv/python2.7.14/lib/python2.7/site-packages/parserator/main.py", line 58, in dispatch
    args.func(args)
  File "/home/travis/virtualenv/python2.7.14/lib/python2.7/site-packages/parserator/main.py", line 85, in train
    training.train(module, train_file_list, modelfile)
  File "/home/travis/virtualenv/python2.7.14/lib/python2.7/site-packages/parserator/training.py", line 83, in train
    training_data = list(readTrainingData(train_file_list, module.GROUP_LABEL))
  File "/home/travis/virtualenv/python2.7.14/lib/python2.7/site-packages/parserator/training.py", line 62, in readTrainingData
    sequence_xml = etree.fromstring(component_string)
  File "src/lxml/etree.pyx", line 3212, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1876, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1764, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1126, in lxml.etree._BaseParser._parseDoc
  File "src/lxml/parser.pxi", line 600, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 710, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 639, in lxml.etree._raiseParseError
  File "<string>", line 1
lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1
make: *** [probablepeople/generic_learned_settings.crfsuite] Error 1

Would please allow XML comments? It could help organize the training and test data sets

fgregg commented 6 years ago

It should be possible https://stackoverflow.com/questions/18313818/how-to-not-load-the-comments-while-parsing-xml-in-lxml, it will need to be changed in parserator.

az0 commented 9 months ago

Today XML comments are working fine