ICLRandD / Blackstone

:black_circle: A spaCy pipeline and model for NLP on unstructured legal text.
https://research.iclr.co.uk
Apache License 2.0
633 stars 101 forks source link

Additional Entity Types & Models #7

Open ICLRandD opened 5 years ago

ICLRandD commented 5 years ago

The prototype Blackstone model, en_blackstone_proto, was trained to detect six entity types that apply generally across legal texts (in the sense that they're not specific to any legal sub-discipline, such as criminal law, company law etc).

If you have any ideas for additional entity types that we should consider adding to future models, this is the place to add them.

Preferred method for setting out your ideas

For the sake of consistency, please add comments to this issue in the following format:

ENTITY TYPE:
ENTITY DESCRIPTION:
LEGAL TOPIC:
EXAMPLE:

For example, if you're submitting an idea for a new entity type that you think would apply generally across legal text (i.e. something that is not specific to any sub-discipline of law) you're comment should look like this:

ENTITY TYPE: Law Commission Report
ENTITY DESCRIPTION: Detects mentions of Law Commission Reports
LEGAL TOPIC: General
EXAMPLE: In addition, she considered the Law Commission Report on Contribution (Law Com No 79) (1977), which led to the enactment of the 1978 Act...

If, on the other hand, you're submitting an idea for a new entity type that you think applies to a particular sub-discipline, you're comment should look like this:

ENTITY TYPE: Indictment
ENTITY DESCRIPTION: Detects mentions of indictments
LEGAL TOPIC: Criminal law
EXAMPLE: On the other indictment (T20180081) there were included three counts of having an article with a blade or point, contrary to section 139(1) of the Criminal Justice Act 1988.
ICLRandD commented 5 years ago

The following suggestions come courtesy of Pete Smith:

ENTITY TYPE: Counsel (Lawyer...?)
ENTITY DESCRIPTION: Detects mentions of legal representatives
LEGAL TOPIC: General
EXAMPLE: Rumpole, H
ENTITY TYPE: Command paper
ENTITY DESCRIPTION: Detects mention of policy documents
LEGAL TOPIC: General
EXAMPLE: Students at the heart of the system Cm 8122
ENTITY TYPE: Book
ENTITY DESCRIPTION: Detects mention of legal treatise / academic work
LEGAL TOPIC: General
EXAMPLE: Halsbury's Laws of England (5th edition) Volume  99 Taxation Law (2018)
ENTITY TYPE: Treaty International Organisations
ENTITY DESCRIPTION: Detects mention of inter-state organisations
LEGAL TOPIC: General
EXAMPLE: United Nations
ENTITY TYPE: Private International Organisations
ENTITY DESCRIPTION: Detects mention of international organisations of a private nature, but not businesses
LEGAL TOPIC: General
EXAMPLE: FIFA, IBA
ENTITY TYPE: Government department
ENTITY DESCRIPTION: Detects mention of government department
LEGAL TOPIC: General
EXAMPLE: Ministry of Justice
DeNeutoy commented 5 years ago

Bit of advice:

It seems very unlikely to me that a model you train will be able to tell the difference between Private International Organisations and Treaty International Organisations.

Consider that the model has literally zero knowledge about the world and is essentially operating on features extracted from the text only. As an example, IBA is a Private International Org, the IMF is a Treaty International Org and IBM is neither. Generally speaking it is very difficult to distinguish these cases for a statistical model.

Similar comments apply to the difference between Lawyer and Judge, although I can imagine that they are often referred to with different titles etc, so maybe it is slightly more possible.

In comparison, Book is a great example of a Named Entity which is likely to work well, because there are things common across mentions of books, such as refererences to pages, publishing dates, editions, consistent title capitalisation. I don't know enough about what a Command Paper is to know if it is mentioned in a way separate from a book.

Government department also seems like a reasonable NER label, I think.

pommedeterresautee commented 5 years ago

At some point the model is able to memorize things, and even if it has zero world knowledge, seeing enough data points is often good enough. For instance, it can remember that the word Treaty is a text, and the word organization is an organization, then learn how to use both of them. More over, pretrained language model (Spacy have its own) is a way to get world knowledge.

akatie commented 3 years ago

I would be pleased to volunteer