accordproject / markdown-transform

Parse and transform markdown text, including TemplateMark markdown templates
Apache License 2.0
70 stars 49 forks source link

Detect Clauses in CommonMark DOM #145

Open dselman opened 4 years ago

dselman commented 4 years ago

Users currently have large volumes of existing legal contracts, containing a wide variety of legal clauses. E.g. the contracts may contain 90 day payment terms, acceptance of delivery clauses etc.

It would be very useful to be able to train a model to detect instances of a specific type of clause (say acceptance of delivery) and then to automatically (or with manual guidance) help the user replace the natural language in their contract with an instance of the acceptance of delivery clause, with the correct values extracted from the input contract.

The template detection logic should operate upon the CommonMark DOM (a JSON representation of Markdown formatted text) - replacing paragraph nodes in the DOM with Clause nodes from the CiceroMark DOM.

It would be useful to be able to run a fuzzy match a CommonMark DOM against a set of templates - detecting potential instances of the templates in the CommonMark text (clauses).

  1. prepareTemplates : pre-process a set of templates to prepare them for matching (works on a training set?)
  2. match : fuzzy match of templates to CommonMark DOM, extract values
  3. uniqueMatches : eliminate duplicate matches, eliminate overlapping matches
jeromesimeon commented 4 years ago

Should this be flagged as a 1.0 feature?

sunn-e commented 4 years ago

My queries are as follows:

  1. Are we trying to detect what kind of Accord contract is in any random text file(.txt,.pdf,.md,etc) submitted by user? That means we need to build an NER model which will be trained on all the Accord contract templates mentioned here . The model will detect what kind of contract it is. There can be 0 or more than that.
  2. Are we trying to separate the Accord project template from raw text input?
  3. Can you please show some example of input and output that the project expects?
sunn-e commented 4 years ago

Is fuzzy matching the best technique so far? What are your views. The Manhattan LSTM architecture has been used too for similar tasks. Semantic Textual Similarity can be our priority since it focuses on cross lingual aspect.

dselman commented 4 years ago
  1. The input would be a CommonMark DOM (JSON data structure).
  2. For a given AP template e.g. acceptance-of-delivery: https://templates.accordproject.org/acceptance-of-delivery@0.13.1.html we would have to assemble a training data set of alternate legal text for this clause (perhaps from https://www.lawinsider.com/clause/delivery-and-acceptance). We then have to train a model to classify incoming text as "acceptance of delivery" and (the hard part?) identify the specific variables used in the template:
"shipper": "Party A",
"receiver": "Party B",
"deliverable": "Widgets",
"businessDays": 10,
"attachment": "Attachment X"

if the model has classified the input paragraphs with a high probability as "acceptance of delivery" and has identified the variables with a high probability then we can use Cicero draft to convert the JSON values for the variables back into legal text, inserting a CiceroMark clause node into the DOM - replacing the input paragraph nodes.

aod.md.txt

npm install -g @accordproject/markdown-cli
markus parse --sample ~/Desktop/aod.md

Output (CommonMark):

{
  "$class": "org.accordproject.commonmark.Document",
  "xmlns": "http://commonmark.org/xml/1.0",
  "nodes": [
    {
      "$class": "org.accordproject.commonmark.Paragraph",
      "nodes": [
        {
          "$class": "org.accordproject.commonmark.Text",
          "text": "This is some text."
        }
      ]
    }, 
    {
      "$class": "org.accordproject.commonmark.Paragraph",
      "nodes": [
        {
          "$class": "org.accordproject.commonmark.Text",
          "text": "This is more."
        }
      ]
    }, 
    {
      "$class": "org.accordproject.commonmark.Heading",
      "level": "2",
      "nodes": [
        {
          "$class": "org.accordproject.commonmark.Text",
          "text": "Acceptance of Delivery."
        }
      ]
    }, 
    {
      "$class": "org.accordproject.commonmark.Paragraph",
      "nodes": [
        {
          "$class": "org.accordproject.commonmark.Text",
          "text": ""Party A" will be deemed to have completed its delivery obligations"
        }, 
        {
          "$class": "org.accordproject.commonmark.Softbreak"
        }, 
        {
          "$class": "org.accordproject.commonmark.Text",
          "text": "if in "Party B"'s opinion, the "Widgets" satisfies the"
        }, 
        {
          "$class": "org.accordproject.commonmark.Softbreak"
        }, 
        {
          "$class": "org.accordproject.commonmark.Text",
          "text": "Acceptance Criteria, and "Party B" notifies "Party A" in writing"
        }, 
        {
          "$class": "org.accordproject.commonmark.Softbreak"
        }, 
        {
          "$class": "org.accordproject.commonmark.Text",
          "text": "that it is accepting the "Widgets"."
        }
      ]
    }, 
    {
      "$class": "org.accordproject.commonmark.Heading",
      "level": "2",
      "nodes": [
        {
          "$class": "org.accordproject.commonmark.Text",
          "text": "Inspection and Notice."
        }
      ]
    }, 
    {
      "$class": "org.accordproject.commonmark.Paragraph",
      "nodes": [
        {
          "$class": "org.accordproject.commonmark.Text",
          "text": ""Party B" will have 10 Business Days to inspect and"
        }, 
        {
          "$class": "org.accordproject.commonmark.Softbreak"
        }, 
        {
          "$class": "org.accordproject.commonmark.Text",
          "text": "evaluate the "Widgets" on the delivery date before notifying"
        }, 
        {
          "$class": "org.accordproject.commonmark.Softbreak"
        }, 
        {
          "$class": "org.accordproject.commonmark.Text",
          "text": ""Party A" that it is either accepting or rejecting the"
        }, 
        {
          "$class": "org.accordproject.commonmark.Softbreak"
        }, 
        {
          "$class": "org.accordproject.commonmark.Text",
          "text": ""Widgets"."
        }
      ]
    }, 
    {
      "$class": "org.accordproject.commonmark.Heading",
      "level": "2",
      "nodes": [
        {
          "$class": "org.accordproject.commonmark.Text",
          "text": "Acceptance Criteria."
        }
      ]
    }, 
    {
      "$class": "org.accordproject.commonmark.Paragraph",
      "nodes": [
        {
          "$class": "org.accordproject.commonmark.Text",
          "text": "The "Acceptance Criteria" are the specifications the "Widgets""
        }, 
        {
          "$class": "org.accordproject.commonmark.Softbreak"
        }, 
        {
          "$class": "org.accordproject.commonmark.Text",
          "text": "must meet for the "Party A" to comply with its requirements and"
        }, 
        {
          "$class": "org.accordproject.commonmark.Softbreak"
        }, 
        {
          "$class": "org.accordproject.commonmark.Text",
          "text": "obligations under this agreement, detailed in "Attachment X", attached"
        }, 
        {
          "$class": "org.accordproject.commonmark.Softbreak"
        }, 
        {
          "$class": "org.accordproject.commonmark.Text",
          "text": "to this agreement."
        }
      ]
    }, 
    {
      "$class": "org.accordproject.commonmark.Paragraph",
      "nodes": [
        {
          "$class": "org.accordproject.commonmark.Text",
          "text": "This is even more."
        }
      ]
    }
  ]
}

The CiceroMark representation (output). Note the org.accordproject.ciceromark.Clause node. test.md.txt

markus parse --sample ~/Downloads/ciceromark.md --cicero
{
  "$class": "org.accordproject.commonmark.Document",
  "xmlns": "http://commonmark.org/xml/1.0",
  "nodes": [
    {
      "$class": "org.accordproject.commonmark.Paragraph",
      "nodes": [
        {
          "$class": "org.accordproject.commonmark.Text",
          "text": "This is some text."
        }
      ]
    }, 
    {
      "$class": "org.accordproject.commonmark.Paragraph",
      "nodes": [
        {
          "$class": "org.accordproject.commonmark.Text",
          "text": "This is more."
        }
      ]
    }, 
    {
      "$class": "org.accordproject.ciceromark.Clause",
      "clauseid": "6b32cc75-40ed-445f-a3de-1bc37392d232",
      "src": "ap://acceptance-of-delivery@0.13.1#b5505785f4de9000be15687601d869cb1719df2f482cd57f3bf4fbd6774127bc",
      "nodes": [
        {
          "$class": "org.accordproject.commonmark.Heading",
          "level": "2",
          "nodes": [
            {
              "$class": "org.accordproject.commonmark.Text",
              "text": "Acceptance of Delivery."
            }
          ]
        }, 
        {
          "$class": "org.accordproject.commonmark.Paragraph",
          "nodes": [
            {
              "$class": "org.accordproject.commonmark.Text",
              "text": ""Party A" will be deemed to have completed its delivery obligations"
            }, 
            {
              "$class": "org.accordproject.commonmark.Softbreak"
            }, 
            {
              "$class": "org.accordproject.commonmark.Text",
              "text": "if in "Party B"'s opinion, the "Widgets" satisfies the"
            }, 
            {
              "$class": "org.accordproject.commonmark.Softbreak"
            }, 
            {
              "$class": "org.accordproject.commonmark.Text",
              "text": "Acceptance Criteria, and "Party B" notifies "Party A" in writing"
            }, 
            {
              "$class": "org.accordproject.commonmark.Softbreak"
            }, 
            {
              "$class": "org.accordproject.commonmark.Text",
              "text": "that it is accepting the "Widgets"."
            }
          ]
        }, 
        {
          "$class": "org.accordproject.commonmark.Heading",
          "level": "2",
          "nodes": [
            {
              "$class": "org.accordproject.commonmark.Text",
              "text": "Inspection and Notice."
            }
          ]
        }, 
        {
          "$class": "org.accordproject.commonmark.Paragraph",
          "nodes": [
            {
              "$class": "org.accordproject.commonmark.Text",
              "text": ""Party B" will have 10 Business Days to inspect and"
            }, 
            {
              "$class": "org.accordproject.commonmark.Softbreak"
            }, 
            {
              "$class": "org.accordproject.commonmark.Text",
              "text": "evaluate the "Widgets" on the delivery date before notifying"
            }, 
            {
              "$class": "org.accordproject.commonmark.Softbreak"
            }, 
            {
              "$class": "org.accordproject.commonmark.Text",
              "text": ""Party A" that it is either accepting or rejecting the"
            }, 
            {
              "$class": "org.accordproject.commonmark.Softbreak"
            }, 
            {
              "$class": "org.accordproject.commonmark.Text",
              "text": ""Widgets"."
            }
          ]
        }, 
        {
          "$class": "org.accordproject.commonmark.Heading",
          "level": "2",
          "nodes": [
            {
              "$class": "org.accordproject.commonmark.Text",
              "text": "Acceptance Criteria."
            }
          ]
        }, 
        {
          "$class": "org.accordproject.commonmark.Paragraph",
          "nodes": [
            {
              "$class": "org.accordproject.commonmark.Text",
              "text": "The "Acceptance Criteria" are the specifications the "Widgets""
            }, 
            {
              "$class": "org.accordproject.commonmark.Softbreak"
            }, 
            {
              "$class": "org.accordproject.commonmark.Text",
              "text": "must meet for the "Party A" to comply with its requirements and"
            }, 
            {
              "$class": "org.accordproject.commonmark.Softbreak"
            }, 
            {
              "$class": "org.accordproject.commonmark.Text",
              "text": "obligations under this agreement, detailed in "Attachment X", attached"
            }, 
            {
              "$class": "org.accordproject.commonmark.Softbreak"
            }, 
            {
              "$class": "org.accordproject.commonmark.Text",
              "text": "to this agreement."
            }
          ]
        }
      ]
    }, 
    {
      "$class": "org.accordproject.commonmark.Paragraph",
      "nodes": [
        {
          "$class": "org.accordproject.commonmark.Text",
          "text": "This is even more."
        }
      ]
    }
  ]
}
dselman commented 4 years ago

Is fuzzy matching the best technique so far? What are your views. The Manhattan LSTM architecture has been used too for similar tasks. Semantic Textual Similarity can be our priority since it focuses on cross lingual aspect.

One of the goals should be to explore some different options (literature review?) and measure accuracy.

sunn-e commented 4 years ago

Thanks for the comments. I'm actually looking for different options and benchmark to see how they perform. Literature review as you said. I will need some time to do that. Meanwhile feel free to suggest any papers you come across that you think could be of any use. This is a very interesting problem. It may help us in creating some novel algorithm. Perhaps a paper. As I have mentioned previously, I'm also working on my paper which is somewhat related to this. This is going to be fun.

sunn-e commented 4 years ago

Question: Is there any restriction on the tech stack I can use?

irmerk commented 4 years ago

@sunn-e I doubt there would be. I think raise it with the community, maybe on slack, if it's not something we currently use. Still, doubt it would be an issue...

jeromesimeon commented 4 years ago

Question: Is there any restriction on the tech stack I can use?

It has to be open source ( I think ! ).

jeromesimeon commented 4 years ago

Question: Is there any restriction on the tech stack I can use?

It has to be open source ( I think ! ).

I would add: multi platform. I don't think we should go for something that is e.g., Windows only.

adityak2920 commented 4 years ago

Will the organisation going to provide computing resources for training models related to document classification and NER?

jeromesimeon commented 4 years ago

Will the organisation going to provide computing resources for training models related to document classification and NER?

Good question! @dselman @adriaan-pelzer any thought?

dselman commented 4 years ago

Yes, within reason, we can provision VMs on AWS to help with training.