More tutorial improvements re #95

mattdahl commented 3 years ago

This PR transforms our tutorial into a Jupyter notebook, which we can link to from the README and should render pleasantly from within GitHub. Re #95, this change should (1) provide an example of how to use eyecite with a real document containing thousands of real citations and (2) make clear that all of the tutorial code is run-able by the reader.

The pseudocode about passing a custom resolution to resolve_citations() has been removed entirely, as we already explain the same idea here: https://github.com/freelawproject/eyecite#resolving-citations.

mlissner commented 3 years ago

I don't know if you still have energy for this, so just merging since it's another nice improvement, but I had a few questions when reading through this:

We mention that the hyperscan regexes are pre-compiled, but never explain how to do that.
We frequently say something like, "We need to clean the text, so we run clean(text, [html, whitespace]). These kinds of statements would be better if we explain what the arguments are that we're passing to clean(). Something like, "Clean takes two arguments. The first is the text to be cleaned, and the second is an array of cleaning utilities to run. We have several built in utilities, such as...."
The section around resolving citations is a bit unclear. This bit caught me up: "This opinion contains more than 1000 raw citations, but that doesn't mean that this opinion cites 1000 different cases." Maybe say something like, "This opinion contains more than 1000 citations, but these are not all full citations like "123 XYZ 456". In addition to these more obvious citations, eyecite will also find short-form citations such as "id" and "supra". So, while there are 1005 citations total, the count of unique opinions cited is much fewer."

Anyway, my policy on documentation is to merge any form of progress, so this is merged, but those were the spots I got caught. Is it easy to edit a Python notebook?

mattdahl commented 3 years ago

I still have some energy lol, and I agree that making the second and third change would be good. For the first one, it's my understanding that the pre-compiling happens automatically, right? I.e., the first time the tokenizer is instantiated, it pre-compiles all the regexes and dumps them into the cache folder. Then subsequent calls to get_citations() just use that without having to re-compile anything. I can state this explicitly, but there's nothing the user has to do, right?

mlissner commented 3 years ago

Yeah, I think that's right, but I haven't witnessed it myself yet. If so, we should state it though. Maybe @jcushman can confirm.

freelawproject / eyecite

More tutorial improvements re #95 #100