adelevie / walverine

Extract case law citations with Node
Other
55 stars 9 forks source link

include match in response object #4

Closed adelevie closed 10 years ago

adelevie commented 10 years ago

Merging @fbennet's pull request https://github.com/adelevie/walverine/pull/3 with a small modification to include the match attribute in the response object from get_citations().

Fixes https://github.com/adelevie/walverine/issues/2

adelevie commented 10 years ago

So matches in my experience so far have been a bit over-inclusive. The excerpt text is often many sentences long past the full citation.

  1. Is there a way to specify excerpt length by number of characters before and after the citation? It would be great then to include this in the output as excerpt.
  2. Is there a way to include just the full citation as written in the input text as part of the output response?
fbennett commented 10 years ago

The way the parse works is to split the document into words, and then pick every reporter element and apply heuristics forward and back, snipping out meaningful elements as it goes. The full citation is not grabbed in a single regexp, but you can get any level of precision out of it that you need. The index of the actual start of the plaintiff/defendant string is probably just not being saved off in the citation object. With a bit of tweaking it can be brought right.

I'm busy for the next couple of days, and I'll be out of Net contact during the weekend. Should be able to work on fixing it up next week, though.