Create JSON schemas that capture the details of the sample court docs
The goal is to consolidate all the important data points from the documents and create an exhaustive list of questions that could be asked of the user to start the drafting process of court orders. The information provided by the user here would serve as the basis for all further document generation steps or training steps. The schema must be separated into common data and unique data so that the questionnaire for the user can be structured correctly. #5
Closeness check
Evaluate if the generated documents are consistent with the original flow and structure of the documents. Implement techniques to quantitatively check the closeness between the original and generated documents.
Benchmarking
Test the system on variety of user responses, LLMs and operational techniques, identify the chokepoints and assess the quality of output under a spectrum of conditions.
Goals & Mid-Point Milestone
Goals
Structure and Semantics Store
[ ] Create a schema containing generic information common to all the docs
[ ] Create a schema containing information that is relatively uncommon.
[ ] Create schema filled with information from the docs to be taken as example during few shot prompting.
[ ] Create schema of details which are frozen across the docs so that the user is not asked to fill those
Closeness Evaluation
[ ] Study a variety of techniques suitable for checking the closeness of documents
[ ] Experiment with different techniques and document the results; Select the most appropriate algorithm or a mix of it.
[ ] Incorporate the closeness check algorithm in the pipeline
Benchmarking
[ ] Integrate the system with a variety of LLMs both from open source and close source nature.
[ ] Extensively evaluate the performance of the system across different models quantitively.
[ ] Test your system's performance against adversarial examples, noisy data, or other forms of input that might cause errors.
[ ] [Goals Achieved By Mid-point Milestone]
Setup/Installation
No response
Expected Outcome
JSON schemas for common data points.
JSON schemas for unique data points.
JSON schemas for frozen data points.
Example schema filled with document information for few shot.
Research for document closeness evaluation techniques.
Implementation of quantitative assessment of closeness between original and generated documents.
Integration with various LLMs.
Evaluation of system performance across different models and user responses
Identification of chokepoints and areas for improvement.
Acceptance Criteria
No response
Implementation Details
JSON Schemas as structure-semantics store
Analyze sample court documents to identify common and unique data points.
Design separate schemas for common, unique, and frozen data points.
Structure schemas to facilitate user questionnaires for drafting court orders.
Closeness check
Research various techniques for evaluating document closeness.
Experiment with different techniques to assess effectiveness.
Select and implement the most suitable algorithm or combination of algorithms for quantitative assessment of closeness.
Benchmarking
Integrate the system with a range of LLMs, both open source and proprietary.
Design experiments to evaluate system performance across different models and user responses.
Test the system against adversarial examples, noisy data, and other challenging inputs to identify areas for improvement.
Mockups/Wireframes
No response
Product Name
Court judgement drafting
Organisation Name
SamagraX
Domain
Service Delivery
Tech Skills Needed
Machine Learning, Natural Language Processing, Python
Documented the discussion around project implementation strategy link
Took the task of working on ticket 1 and 3 that is 'Document Analysis and Section Building' and 'Closeness Evaluation'.
Researched on optimum methods to extract semantics from the documents, especially in the case of Hindi language.
The initial idea for semantics extraction was converting the orders to English and using Named Entity Recognition, Dependency Parsing and Semantic Role Labelling.
In my quest for solutions, I found out that LLMs do really well on the information retrieval task, even in the case of Hindi documents.
Week 2
Read about storing retrieved information to json schema. Experimented upon this capability of LLMs and found out that this is indeed true and LLMs perform quite well on the information retrieval task.
Read the provided court documents. Identified the common components in these documents. Created a json schema for a baseline format for all the necessary information required to draft a court order.
Experimented the same with LLM, json schema was retrieved accurately.
Providing an example court order to the LLM, checked how accurately the LLM could generate a court order draft. Got good results.
Researched on the optimum measures for closeness of generated court orders. Read about Text distance but it won't be the best match for our case.
Instead, looked at the possibility of using ROGUE/BLUE, LLM based, sematic similarity matching etc upon the feedback of mentors.
The provided sample documents were using a non standard font. Thus, considered using OCR to read the Hindi text from the document. Eventually resorted to using converter to Mangal font, which is a standard font. Pushed the converted docs to the repo #7
Added the JSON schema for the common elements in the docs #7
Week 3
Added the JSON schema for unique details that are found in less number of sample docs. #10
Involved in exams; limited bandwidth.
Week 4
Added the completely filled schema to be used for few shot during document generation.
Added the schema containing details of keys which are frozen across all the docs. #12
Ticket Contents
Create JSON schemas that capture the details of the sample court docs The goal is to consolidate all the important data points from the documents and create an exhaustive list of questions that could be asked of the user to start the drafting process of court orders. The information provided by the user here would serve as the basis for all further document generation steps or training steps. The schema must be separated into common data and unique data so that the questionnaire for the user can be structured correctly. #5
Closeness check Evaluate if the generated documents are consistent with the original flow and structure of the documents. Implement techniques to quantitatively check the closeness between the original and generated documents.
Benchmarking Test the system on variety of user responses, LLMs and operational techniques, identify the chokepoints and assess the quality of output under a spectrum of conditions.
Goals & Mid-Point Milestone
Goals
Structure and Semantics Store
Closeness Evaluation
Benchmarking
Setup/Installation
No response
Expected Outcome
Acceptance Criteria
No response
Implementation Details
JSON Schemas as structure-semantics store
Closeness check
Benchmarking
Mockups/Wireframes
No response
Product Name
Court judgement drafting
Organisation Name
SamagraX
Domain
Service Delivery
Tech Skills Needed
Machine Learning, Natural Language Processing, Python
Mentor(s)
@ChakshuGautam @GautamR-Samagra
Category
Machine Learning