TEIC / TEI

The Text Encoding Initiative Guidelines
https://www.tei-c.org
Other
271 stars 88 forks source link

build fails due to `rnv` out of memory #2239

Closed sydb closed 2 years ago

sydb commented 2 years ago

This makes no sense (to me) at all. Both in my local system (where I can test things) and on Jenkins2 (UVic, where I can’t) the build is failing because the rnv command is reporting “error: out of memory” both on p5.xml (which is quite huge) and on testplus.xml (which is < 300 lines long, although to be fair, the schema it is validating against is huge).

So I tried creating a 1-line schema and a 1-line XML instance to validate against it. On my system rnv reported “error: out of memory” even for these tiny files!

I am running version 1.7.8 of rnv on my system, Jenkins2 is running version 1.8.0.

martindholmes commented 2 years ago

This is my experience too, with version 1.7.8 and version 1.8.0 (I'm not sure how I got that one). I don't think rnv itself has changed, so I'm at a loss. Is it possible that something in make itself has changed, causing rnv to be called with less memory available to it?

sydb commented 2 years ago

I don’t think make has anything to do with it, as I tested against a tiny RNC and XML file outside of the Make environment (just directly on the commandline).

hcayless commented 2 years ago

Presumably it's the size of the schema rather than the file being validated that's the problem. It may be time to start throwing some stuff overboard.

sydb commented 2 years ago

I don’t think so, @hcayless. I got the error message when validating

<?xml version="1.0" encoding="UTF-8"?>
<test>13.2</test>

against

datatypes xs = "http://www.w3.org/2001/XMLSchema-datatypes"

element test { xs:double }

The two files total 145 bytes. (Or characters, not sure which.)

sydb commented 2 years ago

I do wonder if a recent update to expat 2.4.4 has anything to do with it. But as far as I can tell, the newest of the dozens of copies of expat on my system is 2.2.9. Still investigating (and not making much progress).

sydb commented 2 years ago

Looks like this is a recently introduced but now known bug. (See below.) I do not want to just ditch the rnv validation, though. I like the redundancy. So my suggestion is to disable rnv validation in the build process by commenting it out for now, and testing it from time to time. (E.g., every time we update p5subset.) This is a devastating bug, so will probably be fixed soon.

sydb commented 2 years ago

below

hartwork commented 2 years ago

@sydb rnv turns out to also use colon (':') for a separator:

# git grep ParserCreate | fgrep -v /Expat/
arx.c:  expat=XML_ParserCreateNS(NULL,':');
tools/rvp.py:parser=xml.parsers.expat.ParserCreate('UTF-8',':') # last colon in the name
xcl.c:  expat=XML_ParserCreateNS(NULL,':');

What is your source for rnv, how do you install it, which version(, where in the CI)?

martindholmes commented 2 years ago

The most recent version I can find that will build is dtolpin's fork of @hartwork's version:

https://github.com/dtolpin/RNV

I can't find a way to build the original on my system, and there are no instructions for doing that. On the dtolpin repo I can use the Makefile.gnu. @hartwork Will your expat 2.4.6 update roll out to Ubuntu automatically, or do we need to install it manually?

sydb commented 2 years ago

@hartwork —

I am not 100% sure, but most likely I downloaded the rnv-1.7.8.zip file from the davidashen FTP site, and then ran make -f Makefile.gnu.

(I do not understand what colons have to do with it, but probably don’t need to. :-)

hartwork commented 2 years ago

The most recent version I can find that will build is dtolpin's fork of @hartwork's version:

https://github.com/dtolpin/RNV

@martindholmes it's actually the other way around: He's the original author, and I helped with some build system fixes, and my "fork" is more recent but also ancient, 2015. His repo points to mine in the description. (Another fork at https://github.com/starseeker/librnv is more active but it's also affected by the colon issue, bundles vulnerable Expat 2.4.3, does not allow reporting issues, and has no e-mail address for contact.)

Potentially the best fix is to auto-convert the RNC syntax to (more widely supported) XML-RNG syntax and then use some other tool to validate against XML-RNG, e.g. xmllint of libxml2 or Trang, see https://relaxng.org/ for more. What do you think?

@hartwork Will your expat 2.4.6 update roll out to Ubuntu automatically, or do we need to install it manually?

That depends on your setup. In any case, 2.4.6 will not help with RNV, because the issue is in RNV.

(I do not understand what colons have to do with it, but probably don’t need to. :-)

Okay :man_shrugging:

martindholmes commented 2 years ago

@hartwork Does this mean that RNV can't actually be fixed to work with the updated libexpat? We do generate RNGs and do validation with other tools, but RNV has been a handy additional checking tool in our test suite. It seems a shame to lose it for the sake of a colon. :-)

hartwork commented 2 years ago

@hartwork Does this mean that RNV can't actually be fixed to work with the updated libexpat? We do generate RNGs and do validation with other tools, but RNV has been a handy additional checking tool in our test suite. It seems a shame to lose it for the sake of a colon. :-)

@martindholmes it can likely be fixed, but the question is in which copy of RNV to adjust and use, what other places in RNV need adjustment to not break it when changing the separator, and if you want to use unmaintained or maintained software. I can fix this issue and add CI to my https://github.com/hartwork/rnv but it won't make it maintained.

sydb commented 2 years ago

So RNV itself will need to be modified? (I had kinda hoped that when Expat is fixed, RNV would magically work again. It (RNV 1.7.8) clearly uses whatever Expat is on the system, as an experiment I tried earlier today demonstrates:

hartwork commented 2 years ago

So RNV itself will need to be modified?

@sydb yes.

sydb commented 2 years ago

Sigh.

hartwork commented 2 years ago

I would like to share the following news:

Have a nice weekend and a good start to the week.

martindholmes commented 2 years ago

Thank you @hartwork, this is very much appreciated! @sydb, as soon as I get a chance I'll build the https://github.com/hartwork/rnv/tree/namespace-separator branch and run our tests with it.

sydb commented 2 years ago

Thank you very much @hartwork! Perhaps as early as tonight I can try to figure out how to build RNV from source your way and test. (I tried by just copying over the Makefile.gnu from 1.7.8, and that worked well enough to test that rnv on this new branch worked on my tiny test file. Which, embarrassingly, turned out to be invalid.)

sydb commented 2 years ago

@martindholmes — Good idea. As mentioned above a I kludged a local rnv, and it worked on one tiny test file. I have to leave the computer for an hour or three now, but will see if I can set my machine to run the TEI build using that new version before I leave … BTW, I don’t really understand the YAML used to describe the build process (from the master branch). But it looks like

$ ./bootstrap
$ mkdir build
$ cd build
$ ../configure
$ cd -
$ make -C build

should do the trick (perhaps adding -j 7 to the make command on my 8-core machine), but I have not actually tried it yet.

martindholmes commented 2 years ago

The final step gives me Error 127, but I was previously able to build 1.7.11 by puzzling through the Github Action, so I'll go back and try that again...

martindholmes commented 2 years ago

Got this, but sudo apt install asciidoc fixed it:

make -C build

martindholmes commented 2 years ago

So I have a build from the branch, but the P5 tests fail the first time rnv is invoked with:

xmllint --noent --dropdtd testplus.xml | rnv testplus.rnc testplus.rnc:4853:20: error: syntax error: ")" expected, "" found testplus.rnc:4858:20: error: syntax error: ")" expected, "" found testplus.rnc:4884:20: error: syntax error: ")" expected, "" found testplus.rnc:4889:20: error: syntax error: ")" expected, "" found testplus.rnc:4966:17: error: syntax error: "}" expected, "{" found testplus.rnc:4968:3: error: syntax error: end of file expected, "}" found testplus.rnc:20776:31: error: missing start testplus.rnc:20776:31: error: undefined reference to 'xsd:float' ...and many more undefined refs to xsd:something...

I'm guessing this is something to do with the colon in the datatype.

sydb commented 2 years ago

Might be due to the unorthodox way I built it, but the good news is that the new pipe-instead version of rnv from @hartwork’s repo, namespace-separator branch seems to run perfectly well without that out of memory error. The bad news is it found all sorts of valid constructs inside some of our RNC files invalid. E.g.:

p5odds.rnc:3547:15: error: syntax error: "{" expected, "*" found
p5odds.rnc:3547:17: error: syntax error: "{" unexpected 
p5odds.rnc:3594:1: error: syntax error: end of file expected, "}" found
p5odds.rnc:15159:1: error: missing start
p5odds.rnc:15159:1: error: undefined reference to 'xsd:NMTOKEN'
p5odds.rnc:15159:1: error: undefined reference to 'xsd:ID'
p5odds.rnc:15159:1: error: undefined reference to 'xsd:IDREF'
p5odds.rnc:15159:1: error: undefined reference to 'xsd:QName'
p5odds.rnc:15159:1: error: undefined reference to 'xsd:anyURI'
p5odds.rnc:15159:1: error: undefined reference to 'xsd:NCName'
p5odds.rnc:15159:1: error: undefined reference to 'xsd:string'
sydb commented 2 years ago

Yes, it looks like complaints are centered on places where there are colons. But there are lots of places that have colons before that a:* on line 3547 that did not cause errors. (It is the first occurrence of “something-colon-asterisk”, though.)

Yes, that may be it. In both cases (testplus.rnc and p5odds.rnc) things blow up on first occurrence of “something-colon-asterisk”.

Gotta go now. Back in a bit.

hartwork commented 2 years ago

I would like to bring this pull request to your attention: https://github.com/libexpat/libexpat/pull/577

hartwork commented 2 years ago

Any news? (Many systems have the CVE-2022-25236 relaxation fix from Expat 2.4.7 applied by now.)

martindholmes commented 2 years ago

Should we be trying a fresh build, expecting that libexpat sources will have been updated and the problem will be fixed?

hartwork commented 2 years ago

Maybe, I don't know your infrastructure.

peterstadler commented 2 years ago

Just triggered a build on our main Jenkins and hit the memory issue:

java -jar ../Utilities/lib/trang.jar testplus.rng testplus.rnc
xmllint --noent --dropdtd  testplus.xml | rnv testplus.rnc
stdin:1:0: error: out of memory

The Jenkins is running jenkins/jenkins:lts image

hartwork commented 2 years ago

@peterstadler thanks for sharing that. Image jenkins/jenkins:lts is Debian GNU/Linux 11 (bullseye) under the hood and bullseye is one of the unpatched places as of today. I know that someone is at it so I think that will be taken care of in the next few days, and then jenkins/jenkins:lts may fix itself, depending on Jenkins' docker image creation automation. Thanks for trying!

hartwork commented 2 years ago

PS: @peterstadler I only now see that first trang is converting .rng to .rnc and then RNV is called above. And you even have xmllint already. xmllint has an argument --relaxng SCHEMA. Was there a conscious decision against using xmllint for Relax NG validation?

peterstadler commented 2 years ago

PS: @peterstadler I only now see that first trang is converting .rng to .rnc and then RNV is called above. And you even have xmllint already. xmllint has an argument --relaxng SCHEMA. Was there a conscious decision against using xmllint for Relax NG validation?

Hmm, no idea … this code is more than 12 years old, see https://github.com/TEIC/TEI/commit/e0f5dd10e09995e3da945ea2229b3802a900f1f0 (and search for xmllint). Maybe @sydb has an idea?

hartwork commented 2 years ago

I know that someone is at it so I think that will be taken care of in the next few days, and then jenkins/jenkins:lts may fix itself, depending on Jenkins' docker image creation automation.

The fix seems to have hit bullseye-security by now with Expat 2.2.10-2+deb11u3: https://tracker.debian.org/news/1310255/accepted-expat-2210-2deb11u3-source-into-stable-security-embargoed-stable-security/ . It is just image jenkins/jenkins:lts of 4 days ago that would need a rebuild and upload now, if I'm not mistaken.

peterstadler commented 2 years ago

Stylesheets group urges for getting the tests back. The current plan is to comment out the rnv part for now (or even put it into a conditional which checks the expat version) and put it back in when the fix is propagated to our Jenkins image.

hartwork commented 2 years ago

@peterstadler Docker image jenkins/jenkins:lts was last built and pushed 5 hours ago but they do not apply all updates appearently and so their Expat is still lacking the needed backport patches. For proof:

# docker pull jenkins/jenkins:lts && docker run --rm -it jenkins/jenkins:lts apt-cache policy libexpat1
lts: Pulling from jenkins/jenkins
Digest: sha256:65cd83179ec68b047ac83875024d1515263ee36362161fe5962f7944071a0f5d
Status: Image is up to date for jenkins/jenkins:lts
docker.io/jenkins/jenkins:lts
libexpat1:
  Installed: 2.2.10-2+deb11u2
  Candidate: 2.2.10-2+deb11u2
  Version table:
 *** 2.2.10-2+deb11u2 100
        100 /var/lib/dpkg/status

I see these ways forward:

hartwork commented 2 years ago

@peterstadler I believe what you just did at https://github.com/TEIC/Jenkins/commit/2cce31196ea427a6b73d2a802abe93394b112d85 worked, this looks promising:

# docker pull teic/jenkins:dev && docker run --rm -it teic/jenkins:dev apt-cache policy libexpat1
dev: Pulling from teic/jenkins
Digest: sha256:eaaeed7cdb0064d39a1a5292f6ce1d559a4686f05038320853b3064f7260488f
Status: Image is up to date for teic/jenkins:dev
docker.io/teic/jenkins:dev
libexpat1:
  Installed: 2.2.10-2+deb11u3
  Candidate: 2.2.10-2+deb11u3
  Version table:
 *** 2.2.10-2+deb11u3 100
        100 /var/lib/dpkg/status
peterstadler commented 2 years ago

Yes, thanks @hartwork for all the support and research!

I updated our Jenkins image in TEIC/Jenkins@2cce311 to explicitly run a apt-get dist-upgrade which brings in the patched libexpat. This updated image is already deployed and the tests seem to work again, see https://jenkins.tei-c.org/view/TEI%20dev/ 🚀