clj-commons / gloss

speaks in bytes, so you don't have to
Eclipse Public License 1.0
485 stars 57 forks source link

New decode-stream race fix fails tests using `with-profile ci` #62

Closed KingMob closed 1 year ago

KingMob commented 1 year ago

@bo-tato Ran into test failures after merging your fix.

The tests seem to succeed consistently when I run lein do clean, test at the shell, and fail pretty consistently when I run lein with-profile ci do clean, test, which is what CircleCI does. You can see an example here. If I do lein with-profile +ci do clean, test, which adds the ci profile to the default list, instead ofreplacing the default list, it also works, at least on my machine.

Intermittent test failures are the worst. Looking at the ci profile in project.clj, it seems the only differences are the addition of malli (which I was playing with), and setting the JVM target class file version to 1.8. Not sure why ci profile overriding fails, but profile addition succeeds...

Anyway, your help would be appreciated. For now, I can't deploy the new version until we straighten it all out.

bo-tato commented 1 year ago

oh that's tricky, I can't get it to fail on my machine. For me lein with-profile ci do clean, test passes:

Ran 23 tests containing 3944 assertions.
0 failures, 0 errors.

I do recognize in your log that those two tests failing in that way is what happens when you just s/connect-via src f dst without an empty vec at the end, so it seems for some reason an empty vec is not getting appended to src. Maybe it is possible that different versions of manifold or clojure the code in append-empty-vec:

(s/concat [(s/->source src)
             (s/->source [[]])]

behaves differently? like maybe it treats (s/->source [[]]) as just an empty seq rather than a stream with a single empty vec in it? could you try println debugging, adding a (println s) before the (s/put-all! dst s) in f to verify if it's seeing an empty vec?

bo-tato commented 1 year ago

if it matters I'm on versions:

OpenJDK 64-Bit Server VM Temurin-17.0.5+8 (build 17.0.5+8, mixed mode, sharing)
Clojure CLI version 1.11.1.1208
KingMob commented 1 year ago

Pretty sure I'm on similar versions, so I don't think that's it. I'm seeing family this week, but I'll take a more in-depth look later next week.

KingMob commented 1 year ago

@bo-tato I think I have to revert this for now. Unfortunately, I just don't have any time for the immediate future to look at this with the depth it requires.

bo-tato commented 1 year ago

ok no problem, I can try and look into it if I can reproduce the failing tests on my machine, do you know what version of jdk and clojure and dependencies and everything it's on that it fails with?

KingMob commented 1 year ago

Thanks for looking into this again.

I thought the info would be in CircleCI (you can see the jobs, right?), but it's not.

However, you can definitely look at the contents of the Docker image used: docker pull cimg/clojure:1.11.1.

It looks like:

circleci@fa46fea04e46:~/project$ java -version
openjdk version "17.0.2" 2022-01-18
OpenJDK Runtime Environment Temurin-17.0.2+8 (build 17.0.2+8)
OpenJDK 64-Bit Server VM Temurin-17.0.2+8 (build 17.0.2+8, mixed mode, sharing)

circleci@fa46fea04e46:~/project$ clojure --version
Clojure CLI version 1.11.1.1105

deps can be seen by running lein deps :tree

bo-tato commented 1 year ago

hm I still can't reproduce the failing test, I did:

docker pull cimg/clojure:1.11.1
docker run -it --rm cimg/clojure:1.11.1
git clone https://github.com/clj-commons/gloss
cd gloss/
git checkout 662ce60a68fc6741eaec4343a21738bc22106bee
lein test
lein with-profile ci do clean, test

I tried a few times and I have all tests passing

bo-tato commented 1 year ago

ok I do get it failing sometimes now. I'm not profesional programmer and don't have any real experience with CI/JVM/clojure so I think this is going to be beyond me but I'll try looking at it some

bo-tato commented 1 year ago

this is bizarre, I edit the project.clj to:

:ci {;;:javac-options ["-target" "1.8" "-source" "1.8"]
    :dependencies [[org.clojure/clojure "1.11.1"]
                   [org.clojure/test.check "1.1.1"]
                   [metosin/malli "0.8.9"]]
    ;:dependencies [[org.clojure/clojure "1.11.1"]
    ;               [metosin/malli "0.8.9"]]
    }

so ci profile is exactly the same as the dev profile. Now I do:

for i in $(seq 10); do lein with-profile dev do clean, test :only gloss.test.core/test-string; done

dev profile always passes, while ci profile still has intermittent failure. Maybe at this point we ask help in clojure forum or slack if anyone has idea what could be going on?

KingMob commented 1 year ago

I'm not profesional programmer and don't have any real experience with CI/JVM/clojure so I think this is going to be beyond me but I'll try looking at it some

Don't worry, you're doing great. 😄


Part of the value of Docker is reproducible environments. So if you ran within Docker, and still saw no issues, that's suggestive. To 100% rule out the environment, though, try the CircleCI command-line tool: https://circleci.com/docs/how-to-use-the-circleci-local-cli/#run-a-job-in-a-container-on-your-machine. It can run Docker locally in an identical manner as their CI servers (environment variables, etc).

If you do that, and still see no (or almost no) problems, the reason isn't any environmental difference, but a difference in hardware capabilities. Our Docker setup is their free tier for open-source, which means it's way less powerful than people's local machines. This is good in the sense that it detects problems that only appear on weak hardware, but bad in the sense that it's harder to reliably reproduce discrepancies.

There's ways to restrict Docker memory and CPU resources. I haven't done it before, but crippling Docker to more closely match the CI server would be the next step.

Can you:

  1. Test locally with the circleci tool? If it's still not failing often enough, can you then:
  2. Reduce Docker's memory and CPU (https://docs.docker.com/config/containers/resource_constraints/) to match CircleCI's. We're using their Medium resource class, which is 4Gb of RAM and 2 vCPUs. (Technically, vCPUs aren't comparable without also knowing clock speed, hypervisor, total cores, etc. but it's as close as we're going to get.) If that doesn't turn up more errors:
  3. Slash the CPU even further (slashing RAM is more likely to make it crash entirely than turn up our bug). If that doesn't work...
  4. Back to the drawing board.
bo-tato commented 1 year ago

interesting thanks, it is reproducing fairly often (at least half the time) for me on docker. I just got (un)lucky and the tests passed fine the first couple times I tried when I initially commented that I couldn't reproduce in docker. Which is why I started running lein test in that for i in $(seq 10) loop, so I still see it fail even if it sometimes randomly passes. I made some progress but am still thoroughly confused. It seems the reason the test was passing fine with dev profile and failing with ci profile even after I modified the profiles to be identical, is that lein adds metadata to dev profile, if you add the metadata :ci ^{:pom-scope :test} to ci profile, then the tests always pass. Also if I comment out malli from ci dependencies, the test passes. Trying to find minimal difference between ci and default profiles, I found if you leave malli in ci dependencies, but just add to ci dependencies "[org.nrepl/incomplete "0.1.0" :exclusions [org.clojure/clojure]]" now the test passes. I added println debugging to that f funtion in decode-stream:

(doseq [b (flatten [bytes])]
       (println "buffer: " (String. (.array b))))
(println "bytes buf-seq is: " bytes)
(println "remainder is: " remainder)
(println "remainder as buf-seq is: " (to-buf-seq remainder))
(println "s is: " s)

both in the branch with my attempted fix, and in the master branch with the fix reverted, and as far as I can tell f is being called with the same bytes as input and then with [] at end, but when it is failing the test s shows as nil when it's processing the final [], when you expect it to contain [[:a abc]]. Another thing I don't understand is that always in master branch also and tests passing, remainder is always nil, which doesn't make sense.

bo-tato commented 1 year ago

phew I finally got the test failing in my normal setup with emacs connected to repl, making it so much easier to deal with. I was going to go crazy if I had to keep making some edit and waiting 30 seconds for lein test to run. It seems the failing test is with the string "abcxyabcxy" and the delimiters are "xy" and "xyz", when it reads the end, read-bytes was not returning abc but instead a codec to keep matching, which seems reasonable as it doesn't know whether to read a "z" before returning the "abc". The reason it works before is as in f it does (binding [complete? (s/drained? src)] ... but my fix messed it up by making a new src with [] concat'd at the end, when it processes the final [], src is not drained yet. If we change it do [complete? (= [] bytes)] now the test does always pass. Also if we set orig-src to be the src before append-empty-vec, and change it to [complete? (s/drained? orig-src)] it also works. Those seem kind of hacky I'm not really sure the cleanest way to set complete? at the end, but at least it seems I finally tracked down what is going on. Still why it was only misbehaving in some profile and with some dependencies I have no clue...

bo-tato commented 1 year ago

can try at https://github.com/bo-tato/gloss/tree/new-decode-stream-fix should pass ci

KingMob commented 1 year ago

Good work! Unfortunately, it's going to take some time for me to look at this, mostly because this is some hairy code, and I really want to set aside the right amount of time to dive into it. My next couple weeks are pretty busy, I'm afraid.

That being said, I fetched your branch, ran the tests a bunch, and at least at first glance, it seems it all works.

Can you think of any more tests to add? Ideally, they should test a different aspect of the problem, and switch from pass to fail after your fix. If not, no worries, I just know Gloss doesn't have a very thorough test suite, and could probably use more.

bo-tato commented 1 year ago

cool no worries yea it'll take some time to really look into properly, and is worth taking the time to look into well so we don't have to fix again. And yea it would be good to have tests that consistently fail for this. I think the most general case is that any codec that can match multiple possibilities, where the input ends on one possibility that is a submatch of a larger possibility (in this case it ended on "xy" which is substring of "xyz") it had the potential to fail (not return the shorter frame as it's waiting to try to match the larger possibility even though the input is over). I've been wanting to learn and play with test.check property based testing I think it could randomly generate some codecs with that property to test, and I think it's already included in gloss as a dev dependency as a result of another test using malli.generator. I also don't have as much free time for learning clojure at the moment so it might be 1-2 weeks before I do it

KingMob commented 1 year ago

@bo-tato OK, can you move this into a PR? And if you can make a test that fails with high-enough probability, we can add it (like one we can run ten times in the hope it fails at least once.)

Still why it was only misbehaving in some profile and with some dependencies I have no clue

Race conditions caused by different environmental setups, probably. The nature of logging is altered or something, one thread now takes longer, and all of a sudden a race that was always won the "right" way is now being lost, because it was never safe to begin with.

Also, how do you want to be listed in the CHANGELOG? I can add your actual name, your username, or you can remain anonymous, whatever you prefer. If you're unsure, I suggest actual name.

KingMob commented 1 year ago

@bo-tato Any news on this? Can you make a PR?

bo-tato commented 1 year ago

Hey, I haven't forgotten about this, just have been busy with work and studies. I'll have time for sure to finish this in june if not earlier

KingMob commented 1 year ago

That's fine. I'm pretty busy myself. Will check back in a couple months.

bo-tato commented 1 year ago

I tried for a while and at the moment I'm unable to reproduce the failing test on my computer, even in the ci profile, for the branch that should fail sometimes. https://github.com/bo-tato/gloss/tree/new-test-attempt adds generating some random tests with the case described in https://github.com/clj-commons/gloss/issues/62#issuecomment-1462551568 but I'm not sure if they're worthwhile as I can't seem to trigger the bug on my computer at the moment. Still when I was testing it a month ago I did have the ci test reliably failing on my computer and and then with the fix from https://github.com/clj-commons/gloss/issues/62#issuecomment-1454823229 didn't get anymore failures, and I think makes sense as the cause of the bug. So I merged that branch with the latest commits from master and submitted pull request if you want to test it some more and check it against ci now

KingMob commented 1 year ago

Closed by #65