freelawproject / eyecite

Find legal citations in any block of text
https://freelawproject.github.io/eyecite/
BSD 2-Clause "Simplified" License
114 stars 27 forks source link

2492 Fixes eyecite CI #140

Closed albertisfu closed 1 year ago

albertisfu commented 1 year ago

This PR fixes the issues described in #139 and #2492

Additionally, I had some problems trying to install eyecite on my machine I couldn't install it on Mac with ARM or using a Ubuntu ARM virtual machine. The problem seems hyperscan, maybe an incompatibility of libhyperscan-dev on ARM.

So I had to install it on my old x86-64 Mac, the problem that I found here was using the last version of Poetry it tried to install the latest version of hyperscan 0.3.x which failed. So I had to downgrade the version to 0.2.0 in pyproject.toml not sure if this is only a problem on mac x86-64 due to hyperscan-dev version for mac or if it's a widespread problem in other SO as well, I suspect it is, since the 1.2.0rc2 version of poetry (that was fixed in actions) installs by default the 0.2.0 version of hyperscan, so after fixed the hyperscan version I removed the poetry 1.2.0rc2 version constraint in CI actions, so now it's using the latest version of poetry.

I think it might be worth it if somebody on Linux could try to update to the latest version of hyperscan and check if the installation succeeds in case we want to use the latest version of hyperscan.

mlissner commented 1 year ago

Thanks Alberto.

Two things here:

  1. First, the eyecite report is worrisome, right @flooie? It shouldn't show a bunch of citations being added and removed, right? It also shows a solid performance improvement, which is cool, but the reason we have the eyecite report is to make sure we're not adding/removing things at weird times like this.

    Any chance you could try to diagnose that, Alberto?

  2. Second, your dependencies. You say:

    I couldn't install it on Mac with ARM

    Well, this is expected, and is why hyperscan is an optional dependency that's supposed to just make things faster. If you look at the hyperscan webpage (on Intel.com), you'll see that it says:

    Hyperscan is a high performance regular expression matching library from Intel that runs on x86 platforms and offers support for Perl Compatible Regular Expressions (PCRE) syntax

    So not working on ARM is no surprise. I am a bit surprised though that it doesn't work on x86 MacOS.

    For installing it in Linux, could we do that via Github Actions? They run on Linux, right?

flooie commented 1 year ago

@mlissner - yes - very curious

albertisfu commented 1 year ago

@mlissner Sure, I'll check why the citations variation happened.

About the dependencies, yeah I can test it via Github actions, the problem I had was generating the poetry.lock file on my machine since it failed. So I'll try it installing an x86 Linux virtual machine and see if it works.

mlissner commented 1 year ago

My theory about this, by the way, is that you enabled hyperscan in this PR and it's not enabled on main, resulting in different tools being used across the branches. Total theory, but I think it makes more sense than an upgrade to hyperscan changing how regexes are interpreted (though that's possible!).

flooie commented 1 year ago

@mlissner that would affect speed only, would it not?

flooie commented 1 year ago

@albertisfu - I've had luck just creating a dockerfile to install and test hyper scan locally. fwiw.

mlissner commented 1 year ago

Hm, the underlying hyperscan C library hadn't upgraded since late 2020 when 5.4.0 came out, but it looks like 0.3.0 of python-hyperscan was released in April of last year and upgraded to that version of hyperscan. I wasn't able to tell which version of hyperscan was used by python-hyperscan before April, 2022 though (some digging here might reveal what the old version of hyperscan was that python-hyperscan used: https://github.com/darvid/python-hyperscan/commit/0527aac060f29413a727e7fbb8db2c456bd25aab.

One thing I saw in the hyperscan change log is:

Introduce chimera hybrid engine of Hyperscan and PCRE, to fully support PCRE syntax as well as to take advantage of the high performance nature of Hyperscan.

That sort of makes it sound like it's changing how hyperscan works, but it's from 2018-07-09, which is a long time ago. OTOH, the changelog of python-hyperscan says that version 0.3.0 (from late last year):

Feature: initial Chimera support (and upgrade to Hyperscan v5.4.0) (0527aac)

So.....maybe that's the root cause?

Maybe python-hyperscan was using a very old version of hyperscan and finally upgraded to a new version that uses Chimera, which brought more regex compatibility/support?

Maybe....

mlissner commented 1 year ago

@mlissner that would affect speed only, would it not?

Well, in theory, a regex is a regex, and switching between Python's regex engine and hyperscan's wouldn't make for different results. That seemed to be the result in the past, but maybe it's not anymore. I've seen different regex engines support different parts of the standard. I even remember Firefox and Chrome had different support at one point (probably still do, ugh).

github-actions[bot] commented 1 year ago

The Eyecite Report :eye:

Gains and Losses

There were 0 gains and 0 losses.

Click here to see details. | id | Gain | Loss | | ---------- | ------ | ------ |

Time Chart

image

Generated Files

Branch 1 Output Branch 2 Output Full Output CSV

albertisfu commented 1 year ago

Well, the problem was not hyperscan I was able to install the latest version on docker using an amd64 image so I removed the fixed version in pyproject.toml.

After multiple tests, the problem seems to be related to reporters-db, specifically after from 3.2.34 version.
I downgrade reporters-db here to 3.2.32; you can see the report seems well.

In this other PR, I set it to 3.2.34 and you can see the report with the citations variation.

I think the problem here was the poetry.lock file was not updated in a while, the last update was on Dec 27, 2022 and reporters-db version was still 3.2.32 even though 3.2.34 was released on Nov 22.

Since the benchmark for reporters-db and eyecite PRs depend on the same workflow (that uses the same poetry.lock) benchmark reports on reporters-db PRs didn't detect the problem either.

Maybe including a poetry.lock update within the benchmark workflow might help to avoid this from happening again.

Let me know what you think.

mlissner commented 1 year ago

Nice. Well, that's good that it's just our code that had the bug. Figures, I suppose!

Maybe including a poetry.lock update within the benchmark workflow might help to avoid this from happening again.

Yeah, @flooie do you want to file a bug to take a look at either doing that or just making sure that we upgrade reporters-db when running the eyecite report, so we catch these things more quickly next time? And I guess we need a separate bug for figuring out what happened to the reporter DB that you or Kevin can take care of?

mlissner commented 1 year ago

For the record, I think I was wrong about everything on this issue. :)

Ah well...