achen353 / TransformerSum

BERT-based extractive summarizer for long legal document using a divide-and-conquer approach
GNU General Public License v3.0
3 stars 0 forks source link

DANCER Integration (PART 1) #22

Closed achen353 closed 2 years ago

achen353 commented 2 years ago

Context

19

Summary

Relevant Technical Choices

Sample Processed Data

{'src': 
 [
    [
        ['Section', '219', 'of', 'the', 'Water', 'Resources', 'Development', 'Act', 'of', '1992', 'is', 'amended', 'in', 'subsection', '(', 'c', ')', ',', 'by', 'striking', 'paragraph', '(', '5', ')', 'and', 'inserting', 'the', 'following', ':', 'Jackson', 'county', ',', 'mississippi', '.'],
        ['Provision', 'of', 'an', 'alternative', 'water', 'supply', 'and', 'a', 'project', 'for', 'the', 'elimination', 'or', 'control', 'of', 'combined', 'sewer', 'overflows', 'for', 'Jackson', 'County', ',', 'Mississippi', '.', '"', '.'], 
        ['And', 'in', 'subsection', '(', 'e)(1', ')', ',', 'by', 'striking', '"', '$', '10,000,000', '"', 'and', 'inserting', '"', '$', '20,000,000', '"', '.'], ['Section', '219(e)(3', ')', 'of', 'the', 'Water', 'Resources', 'Development', 'Act', 'of', '1992', 'is', 'amended', 'by', 'striking', '"', '$', '10,000,000', '"', 'and', 'inserting', '"', '$', '20,000,000', '"', '.'], 
        ['Section', '219(f)(1', ')', 'of', 'the', 'Water', 'Resources', 'Development', 'Act', 'of', '1992', 'is', 'amended', 'by', 'striking', '"', '$', '25,000,000', 'for', '"', '.'], 
        ['Paterson', ',', 'Passaic', 'County', ',', 'and', 'Passaic', 'Valley', ',', 'New', 'Jersey', '.'], 
        ['Section', '219(f)(2', ')', 'of', 'the', 'Water', 'Resources', 'Development', 'Act', 'of', '1992', 'is', 'amended', 'by', 'striking', '"', '$', '20,000,000', 'for', '"', '.'], 
        ['Elizabeth', 'and', 'North', 'Hudson', ',', 'New', 'Jersey', '.'], 
        ['Section', '219(f', ')', 'of', 'the', 'Water', 'Resources', 'Development', 'Act', 'of', '1992', 'is', 'amended', 'in', 'paragraph', '(', '33', ')', ',', 'by', 'striking', '"', '$', '20,000,000', '"', 'and', 'inserting', '"', '$', '10,000,000', '"', ',', 'and', 'in', 'paragraph', '(', '34', ')', 'by', 'striking', '"', '$', '10,000,000', '"', 'and', 'inserting', '"', '$', '20,000,000', '"', '.'], 
        ['And', 'by', 'striking', '"', 'in', 'the', 'city', 'of', 'North', 'Hudson', '"', 'and', 'inserting', '"', 'for', 'the', 'North', 'Hudson', 'Sewerage', 'Authority', '"', '.']
    ], 
    [
        ['UPPER', 'MISSISSIPPI', 'RIVER', 'ENVIRONMENTAL', 'MANAGEMENT', 'PROGRAM', '.'], 
        ['Section', '1103(e)(5', ')', 'of', 'the', 'Water', 'Resources', 'Development', 'Act', 'of', '1986', '(', '33', 'USC', '.', '652(e)(5', ')', ')'], 
        ['(', 'as', 'amended', 'by', 'section', '509(c)(3', ')', 'of', 'the', 'Water', 'Resources', 'Development', 'Act', 'of', '1999', 'is', 'amended', 'by', 'striking', '"', 'paragraph', '(', '1)(A)(i', ')', '"', 'and', 'inserting', '"', 'paragraph', '(', '1)(B', ')', '"', '.']
    ], 
    [
        ['Section', '371', 'of', 'the', 'Water', 'Resources', 'Development', 'Act', 'of', '1999', 'is', 'amended', 'by', 'inserting', '"', '(', 'a', ')', 'In', 'General', '.'], 
        ['And', 'by', 'adding', 'at', 'the', 'end', 'the', 'following', ':', 'Crediting', 'of', 'Reduction', 'in', 'Non', '-', 'Federal', 'Share', '.'],
        ['The', 'project', 'cooperation', 'agreement', 'for', 'the', 'Comite', 'River', 'Diversion', 'Project', 'shall', 'include', 'a', 'provision', 'that', 'specifies', 'that', 'any', 'reduction', 'in', 'the', 'non-', 'Federal', 'share', 'that', 'results', 'from', 'the', 'modification', 'under', 'subsection', '(', 'a', ')', 'shall', 'be', 'credited', 'toward', 'the', 'share', 'of', 'project', 'costs', 'to', 'be', 'paid', 'by', 'the', 'Amite', 'River', 'Basin', 'Drainage', 'and', 'Water', 'Conservation', 'District', '.', '"', '.']
    ], 
    [
        ['CONTINUATION', 'OF', 'SUBMISSION', 'OF', 'CERTAIN', 'REPORTS', 'BY', 'THE', 'SECRETARY', 'OF', 'THE', 'ARMY', '.'],        
        ['Recommendations', 'of', 'Inland', 'Waterways', 'Users', 'Board', '.'], ['Section', '302(b', ')', 'of', 'the', 'Water', 'Resources', 'Development', 'Act', 'of', '1986', '(', '33', 'USC', '.', '2251(b', ')', ')', 'is', 'amended', 'in', 'the', 'last', 'sentence', 'by', 'striking', '"', 'The', '"', 'and', 'inserting', '"', 'Notwithstanding', 'section', '3003', 'of', 'Public', 'Law', '104', '-', '66', ',', 'the', '"', '.'], 
        ['List', 'of', 'Authorized', 'but', 'Unfunded', 'Studies', '.'], ['Section', '710(a', ')', 'of', 'the', 'Water', 'Resources', 'Development', 'Act', 'of', '1986', '(', '33', 'USC', '.', '2264(a', ')', ')', 'is', 'amended', 'in', 'the', 'first', 'sentence', 'by', 'striking', '"', 'Not', '"', 'and', 'inserting', '"', 'Notwithstanding', 'section', '3003', 'of', 'Public', 'Law', '104', '-', '66', ',', 'not', '"', '.'], 
        ['Reports', 'on', 'Participation', 'of', 'Minority', 'Groups', 'and', 'Minority', '-', 'Owned', 'Firms', 'in', 'Mississippi', 'River', '-', 'Gulf', 'Outlet', 'Feature', '.'], 
        ['Section', '844(b', ')', 'of', 'the', 'Water', 'Resources', 'Development', 'Act', 'of', '1986', 'is', 'amended', 'in', 'the', 'second', 'sentence', 'by', 'striking', '"', 'The', '"', 'and', 'inserting', '"', 'Notwithstanding', 'section', '3003', 'of', 'Public', 'Law', '104', '-', '66', ',', 'the', '"', '.'], 
        ['List', 'of', 'Authorized', 'but', 'Unfunded', 'Projects', '.'], ['Section', '1001(b)(2', ')', 'of', 'the', 'Water', 'Resources', 'Development', 'Act', 'of', '1986', '(', '33', 'USC', '.', '579a(b)(2', ')', ')', 'is', 'amended', 'in', 'the', 'first', 'sentence', 'by', 'striking', '"', 'Every', '"', 'and', 'inserting', '"', 'Notwithstanding', 'section', '3003', 'of', 'Public', 'Law', '104', '-', '66', ',', 'every', '"', '.']
    ], 
    [
        ['AUTHORIZATIONS', 'FOR', 'PROGRAM', 'PREVIOUSLY', 'AND', 'CURRENTLY', 'FUNDED', '.'], 
        ['The', 'program', 'described', 'in', 'subsection', '(', 'c', ')', 'is', 'hereby', 'authorized', '.'], 
        ['Funds', 'are', 'hereby', 'authorized', 'to', 'be', 'appropriated', 'for', 'the', 'Department', 'of', 'Transportation', 'for', 'the', 'program', 'authorized', 'in', 'subsection', '(', 'a', ')', 'in', 'amounts', 'as', 'follows', ':', 'Fiscal', 'year', '2000', '.'], 
        ['For', 'fiscal', 'year', '2000', ',', '$', '10,000,000', '.'], 
        ['For', 'fiscal', 'year', '2001', ',', '$', '10,000,000', '.'], 
        ['For', 'fiscal', 'year', '2002', ',', '$', '7,000,000', '.'], 
        ['The', 'program', 'referred', 'to', 'in', 'subsection', '(', 'a', ')', 'is', 'the', 'program', 'for', 'which', 'funds', 'appropriated', 'in', 'title', 'I', 'of', 'Public', 'Law', '106-', '69', 'under', 'the', 'heading', '"', 'FEDERAL', 'RAILROAD', 'ADMINISTRATION', '"', 'are', 'available', 'for', 'obligation', 'upon', 'the', 'enactment', 'of', 'legislation', 'authorizing', 'the', 'program', '.'], ['Speaker', 'of', 'the', 'House', 'of', 'Representatives', '.'], ['Vice', 'President', 'of', 'the', 'United', 'States', 'and', 'President', 'of', 'the', 'Senate', '.']
    ]
], 
'labels': 
    [
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
        [0, 0, 0], 
        [0, 0, 1], 
        [0, 0, 0, 0, 0, 0, 0, 0, 0], 
        [0, 0, 0, 0, 0, 0, 1, 0, 1]
    ], 
'tgt': 
    "Amends the Water Resources Development Act of 1999 to : ( 1 ) authorize appropriations for FY 1999 through 2009 for implementation of a long - term resource monitoring program with respect to the Upper Mississippi River Environmental Management Program .<q>( 2 ) authorize the Secretary of the Army to carry out modifications to the navigation project for the Delaware River , Pennsylvania and Delaware , if such project as modified is technically sound , environmentally acceptable , and economically justified .<q>( 3 ) subject certain previously deauthorized water resources development projects to the seven - year limitation governing project deauthorizations under the Act , with the exception of such a project for Indian River County , Florida .<q>( 4 ) except from a certain schedule of the non - Federal cost of the periodic nourishment of shore protection projects constructed after December 31 , 1999 , those projects for which a District Engineer 's Report has been completed by such date .<q>( 5 ) require that the project cooperation agreement for the Comite River Diversion Project for flood control include a provision that specifies that any reduction in the non - Federal share that results from certain modifications be credited toward the share of project costs to be paid by the Amite River Basin Drainage and Water Conservation District .<q>( 6 ) allow the Secretary to provide additional compensation to Chesapeake City , Maryland for damage to its water supply resulting from the Chesapeake and Delaware Canal Project .<q>( 7 ) provide for the submission of certain reports on water resources development projects by the Secretary , notwithstanding Federal reporting termination provisions .<q>And ( 8) authorize and provide for an authorization of appropriations for the existing program for the safety and operations expenses of the Federal Railroad Administration , and make available for obligation funds currently appropriated for such program ."
}

Note

There are some parameters to be tuned after the merge of this PR. For example, we can see that most of the labels are 0 and it would affect the training. This JSON snippet is just for reference of what the structure looks like.

achen353 commented 2 years ago

@stephanieeechang I'm attaching the processed files here (because some go beyond the 100MB file size limit for GitHub and would require Github LFS).

https://drive.google.com/file/d/1sp8h1hSZRzJGtYaA0zzxXS6hZbGdfG_5/view?usp=sharing

These files are NOT FINALIZED. Due to time constraints, I'm just sharing this with you so you can better work on your part based on the JSON structure. See the Description of this PR to see why it's not finalized.

There are four JSON files and once you extract the zip file, you should put them in /datasets/billsum_extractive/

@andywang268 You should also be aware of this too.