Closed albertisfu closed 1 year ago
Thanks Alberto.
Two things here:
First, the eyecite report is worrisome, right @flooie? It shouldn't show a bunch of citations being added and removed, right? It also shows a solid performance improvement, which is cool, but the reason we have the eyecite report is to make sure we're not adding/removing things at weird times like this.
Any chance you could try to diagnose that, Alberto?
Second, your dependencies. You say:
I couldn't install it on Mac with ARM
Well, this is expected, and is why hyperscan is an optional dependency that's supposed to just make things faster. If you look at the hyperscan webpage (on Intel.com), you'll see that it says:
Hyperscan is a high performance regular expression matching library from Intel that runs on x86 platforms and offers support for Perl Compatible Regular Expressions (PCRE) syntax
So not working on ARM is no surprise. I am a bit surprised though that it doesn't work on x86 MacOS.
For installing it in Linux, could we do that via Github Actions? They run on Linux, right?
@mlissner - yes - very curious
@mlissner Sure, I'll check why the citations variation happened.
About the dependencies, yeah I can test it via Github actions, the problem I had was generating the poetry.lock file on my machine since it failed. So I'll try it installing an x86 Linux virtual machine and see if it works.
My theory about this, by the way, is that you enabled hyperscan in this PR and it's not enabled on main
, resulting in different tools being used across the branches. Total theory, but I think it makes more sense than an upgrade to hyperscan changing how regexes are interpreted (though that's possible!).
@mlissner that would affect speed only, would it not?
@albertisfu - I've had luck just creating a dockerfile to install and test hyper scan locally. fwiw.
Hm, the underlying hyperscan
C library hadn't upgraded since late 2020 when 5.4.0 came out, but it looks like 0.3.0 of python-hyperscan was released in April of last year and upgraded to that version of hyperscan. I wasn't able to tell which version of hyperscan was used by python-hyperscan before April, 2022 though (some digging here might reveal what the old version of hyperscan was that python-hyperscan used: https://github.com/darvid/python-hyperscan/commit/0527aac060f29413a727e7fbb8db2c456bd25aab.
One thing I saw in the hyperscan change log is:
Introduce chimera hybrid engine of Hyperscan and PCRE, to fully support PCRE syntax as well as to take advantage of the high performance nature of Hyperscan.
That sort of makes it sound like it's changing how hyperscan works, but it's from 2018-07-09, which is a long time ago. OTOH, the changelog of python-hyperscan says that version 0.3.0 (from late last year):
Feature: initial Chimera support (and upgrade to Hyperscan v5.4.0) (0527aac)
So.....maybe that's the root cause?
Maybe python-hyperscan was using a very old version of hyperscan and finally upgraded to a new version that uses Chimera, which brought more regex compatibility/support?
Maybe....
@mlissner that would affect speed only, would it not?
Well, in theory, a regex is a regex, and switching between Python's regex engine and hyperscan's wouldn't make for different results. That seemed to be the result in the past, but maybe it's not anymore. I've seen different regex engines support different parts of the standard. I even remember Firefox and Chrome had different support at one point (probably still do, ugh).
There were 0 gains and 0 losses.
Well, the problem was not hyperscan
I was able to install the latest version on docker using an amd64
image so I removed the fixed version in pyproject.toml.
After multiple tests, the problem seems to be related to reporters-db
, specifically after from 3.2.34
version.
I downgrade reporters-db
here to 3.2.32
; you can see the report seems well.
In this other PR, I set it to 3.2.34
and you can see the report with the citations variation.
I think the problem here was the poetry.lock file was not updated in a while, the last update was on Dec 27, 2022 and reporters-db
version was still 3.2.32 even though 3.2.34 was released on Nov 22.
Since the benchmark for reporters-db
and eyecite
PRs depend on the same workflow (that uses the same poetry.lock
) benchmark reports on reporters-db
PRs didn't detect the problem either.
Maybe including a poetry.lock
update within the benchmark workflow might help to avoid this from happening again.
Let me know what you think.
Nice. Well, that's good that it's just our code that had the bug. Figures, I suppose!
Maybe including a poetry.lock update within the benchmark workflow might help to avoid this from happening again.
Yeah, @flooie do you want to file a bug to take a look at either doing that or just making sure that we upgrade reporters-db when running the eyecite report, so we catch these things more quickly next time? And I guess we need a separate bug for figuring out what happened to the reporter DB that you or Kevin can take care of?
For the record, I think I was wrong about everything on this issue. :)
Ah well...
This PR fixes the issues described in #139 and #2492
fast_diff_match_patch
,exrex
was not related to a dependency problem but a CI cache issue. Seems the cache was broken, to solve it I changed the key removing the date-fixed string since now is possible to remove caches from the Actions tab, it would be easier to just remove old caches instead of changing the cache key.Additionally, I had some problems trying to install
eyecite
on my machine I couldn't install it on Mac with ARM or using a Ubuntu ARM virtual machine. The problem seemshyperscan
, maybe an incompatibility oflibhyperscan-dev
on ARM.So I had to install it on my old x86-64 Mac, the problem that I found here was using the last version of Poetry it tried to install the latest version of hyperscan 0.3.x which failed. So I had to downgrade the version to 0.2.0 in
pyproject.toml
not sure if this is only a problem on mac x86-64 due tohyperscan-dev
version for mac or if it's a widespread problem in other SO as well, I suspect it is, since the 1.2.0rc2 version of poetry (that was fixed in actions) installs by default the 0.2.0 version ofhyperscan
, so after fixed thehyperscan
version I removed the poetry1.2.0rc2
version constraint in CI actions, so now it's using the latest version of poetry.I think it might be worth it if somebody on Linux could try to update to the latest version of
hyperscan
and check if the installation succeeds in case we want to use the latest version ofhyperscan
.