achen353 / TransformerSum

BERT-based extractive summarizer for long legal document using a divide-and-conquer approach
GNU General Public License v3.0
3 stars 0 forks source link

Fix 14 billsum data cleaning #16

Closed stephanieeechang closed 2 years ago

stephanieeechang commented 2 years ago

Add clean_text() and replace_semicolon() to helpers.py. Then call function clean_text() in convert_to_extractive.py

achen353 commented 2 years ago

Can you give an example of the resulting data entry after running convert_to_extractive.py on BillSum (which will activate the cleaning script you added)?

Just a single data entry would work, if you could format it like this (this is from the cleaned data but the output of convert_to_extractive.py would for sure be structured differently: https://github.com/achen353/TransformerSum/issues/14#issuecomment-969819538

achen353 commented 2 years ago

Can you give an example of the resulting data entry after running convert_to_extractive.py on BillSum (which will activate the cleaning script you added)?

We don't have unit tests or snapshot tests here and hence I'm asking for an example of the result.

achen353 commented 2 years ago

@stephanieeechang Sorry I might not have made it clear. https://github.com/achen353/TransformerSum/issues/14#issuecomment-971004070 This is good but I was asking for an entry from the resulting JSON file after running convert_to_extractive.py.

We've seen https://github.com/achen353/TransformerSum/issues/14#issuecomment-971004070 this already during the meeting.

stephanieeechang commented 2 years ago

@achen353 Sure! The whole snippet is as seen below. This is an example from ca_test.json and the same one my previous comment included.

{"src": [["The", "people", "of", "the", "State", "of", "California", "do", "enact", "as", "follows", ":", "<", "SECTION", "-", "HEADER", ">"], ["The", "Legislature", "finds", "and", "declares", "all", "of", "the", "following", ":", "(", "1", ")", "Since", "1899", "congressionally", "chartered", "veterans", "\u2019", "organizations", "have", "provided", "a", "valuable", "service", "to", "our", "nation", "\u2019s", "returning", "service", "members", "."], ["These", "organizations", "help", "preserve", "the", "memories", "and", "incidents", "of", "the", "great", "hostilities", "fought", "by", "our", "nation", ",", "and", "preserve", "and", "strengthen", "comradeship", "among", "members", "."], ["These", "veterans", "\u2019", "organizations", "also", "own", "and", "manage", "various", "properties", "including", "lodges", ",", "posts", ",", "and", "fraternal", "halls", "."], ["These", "properties", "act", "as", "a", "safe", "haven", "where", "veterans", "of", "all", "ages", "and", "their", "families", "can", "gather", "together", "to", "find", "camaraderie", "and", "fellowship", ",", "share", "stories", ",", "and", "seek", "support", "from", "people", "who", "understand", "their", "unique", "experiences", "."], ["This", "aids", "in", "the", "healing", "process", "for", "these", "returning", "veterans", ",", "and", "ensures", "their", "health", "and", "happiness", "."], ["As", "a", "result", "of", "congressional", "chartering", "of", "these", "veterans", "\u2019", "organizations", ",", "the", "United", "States", "Internal", "Revenue", "Service", "created", "a", "special", "tax", "exemption", "for", "these", "organizations", "under", "Section", "501(c)(19", ")", "of", "the", "Internal", "Revenue", "Code", "."], ["Section", "501(c)(19", ")", "of", "the", "Internal", "Revenue", "Code", "and", "related", "federal", "regulations", "provide", "for", "the", "exemption", "for", "posts", "or", "organizations", "of", "war", "veterans", ",", "or", "an", "auxiliary", "unit", "or", "society", "of", ",", "or", "a", "trust", "or", "foundation", "for", ",", "any", "such", "post", "or", "organization", "that", ",", "among", "other", "attributes", ",", "carries", "on", "programs", "to", "perpetuate", "the", "memory", "of", "deceased", "veterans", "and", "members", "of", "the", "Armed", "Forces", "and", "to", "comfort", "their", "survivors", ",", "conducts", "programs", "for", "religious", ",", "charitable", ",", "scientific", ",", "literary", ",", "or", "educational", "purposes", ",", "sponsors", "or", "participates", "in", "activities", "of", "a", "patriotic", "nature", ",", "and", "provides", "social", "and", "recreational", "activities", "for", "their", "members", "."], ["Section", "215.1", "of", "the", "Revenue", "and", "Taxation", "Code", "stipulates", "that", "all", "buildings", ",", "support", "and", "so", "much", "of", "the", "real", "property", "on", "which", "the", "buildings", "are", "situated", "as", "may", "be", "required", "for", "the", "convenient", "use", "and", "occupation", "of", "the", "buildings", ",", "used", "exclusively", "for", "charitable", "purposes", ",", "owned", "by", "a", "veterans", "\u2019", "organization", "that", "has", "been", "chartered", "by", "the", "Congress", "of", "the", "United", "States", ",", "organized", "and", "operated", "for", "charitable", "purposes", ",", "when", "the", "same", "are", "used", "solely", "and", "exclusively", "for", "the", "purpose", "of", "the", "organization", ",", "if", "not", "conducted", "for", "profit", "and", "no", "part", "of", "the", "net", "earnings", "of", "which", "ensures", "to", "the", "benefit", "of", "any", "private", "individual", "or", "member", "thereof", ",", "are", "exempt", "from", "taxation", "."], ["The", "Chief", "Counsel", "of", "the", "State", "Board", "of", "Equalization", "concluded", ",", "based", "on", "a", "1979", "appellate", "court", "decision", ",", "that", "only", "parts", "of", "American", "Legion", "halls", "are", "exempt", "from", "property", "taxation", "and", "that", "other", "parts", ",", "such", "as", "billiard", "rooms", ",", "card", "rooms", ",", "and", "similar", "areas", ",", "are", "not", "exempt", "."], ["In", "a", "1994", "memorandum", ",", "the", "State", "Board", "of", "Equalization", "\u2019s", "legal", "division", "further", "concluded", "that", "the", "areas", "normally", "considered", "eligible", "for", "exemptions", "are", "the", "office", "areas", "used", "to", "counsel", "veterans", "and", "the", "area", "used", "to", "store", "veterans", "\u2019", "records", ",", "but", "that", "the", "meeting", "hall", "and", "bar", "found", "in", "most", "of", "the", "facilities", "are", "not", "considered", "used", "for", "charitable", "purposes", "."], ["Tax", "-", "exempt", "status", "is", "intended", "to", "provide", "economic", "incentive", "and", "support", "to", "veterans", "\u2019", "organizations", "to", "provide", "for", "the", "social", "welfare", "of", "the", "community", "of", "current", "and", "former", "military", "personnel", "."], ["The", "State", "Board", "of", "Equalization", "\u2019s", "constriction", "of", "the", "tax", "exemption", "has", "resulted", "in", "an", "onerous", "tax", "burden", "on", "California", "veteran", "service", "organizations", "posts", "or", "halls", ",", "hinders", "the", "posts", "\u2019", "ability", "to", "provide", "facilities", "for", "veterans", ",", "and", "threatens", "the", "economic", "viability", "of", "many", "local", "organizations", "."], ["The", "charitable", "activities", "of", "a", "veteran", "service", "organizations", "post", "or", "hall", "are", "much", "more", "than", "the", "counseling", "of", "veterans", "."], ["The", "requirements", "listed", "for", "qualification", "for", "the", "federal", "tax", "exemption", "clearly", "dictate", "a", "need", "for", "more", "than", "just", "an", "office", "."], ["Programs", "to", "perpetuate", "the", "memory", "of", "deceased", "veterans", "and", "members", "of", "the", "Armed", "Forces", "and", "to", "comfort", "their", "survivors", "require", "the", "use", "of", "facilities", "for", "funerals", "and", "receptions", "."], ["Programs", "for", "religious", ",", "charitable", ",", "scientific", ",", "literary", ",", "or", "educational", "purposes", "require", "space", "for", "more", "than", "50", "attendees", "."], ["Activities", "of", "a", "patriotic", "nature", "need", "facilities", "to", "accommodate", "hundreds", "of", "people", "."], ["Social", "and", "recreational", "activities", "for", "members", "require", "precisely", "those", "areas", "considered", "\u201c", "not", "used", "for", "charitable", "purposes", "\u201d", "by", "the", "State", "Board", "of", "Equalization", "."], ["The", "State", "Board", "of", "Equalization", "\u2019s", "interpretation", "of", "the", "Revenue", "and", "Taxation", "Code", "reflects", "a", "lack", "of", "understanding", "of", "the", "purpose", "and", "programs", "of", "the", "veterans", "service", "organizations", "posts", "or", "halls", "and", "is", "detrimental", "to", "the", "good", "works", "performed", "in", "support", "of", "our", "veteran", "community", "."], ["<", "SECTION", "-", "HEADER><SECTION", "-", "HEADER", ">", "Section", "215.1", "of", "the", "Revenue", "and", "Taxation", "Code", "is", "amended", "to", "read", ":", "215.1", "."], ["All", "buildings", ",", "and", "so", "much", "of", "the", "real", "property", "on", "which", "the", "buildings", "are", "situated", "as", "may", "be", "required", "for", "the", "convenient", "use", "and", "occupation", "of", "the", "buildings", ",", "used", "exclusively", "for", "charitable", "purposes", ",", "owned", "by", "a", "veterans", "\u2019", "organization", "that", "has", "been", "chartered", "by", "the", "Congress", "of", "the", "United", "States", ",", "organized", "and", "operated", "for", "charitable", "purposes", ",", "and", "exempt", "from", "federal", "income", "tax", "as", "an", "organization", "described", "in", "Section", "501(c)(19", ")", "of", "the", "Internal", "Revenue", "Code", "when", "the", "same", "are", "used", "solely", "and", "exclusively", "for", "the", "purpose", "of", "the", "organization", ",", "if", "not", "conducted", "for", "profit", "and", "no", "part", "of", "the", "net", "earnings", "of", "which", "inures", "to", "the", "benefit", "of", "any", "private", "individual", "or", "member", "thereof", ",", "shall", "be", "exempt", "from", "taxation", "."], ["The", "exemption", "provided", "for", "in", "this", "section", "shall", "apply", "to", "the", "property", "of", "all", "organizations", "meeting", "the", "requirements", "of", "this", "section", ",", "subdivision", "(", "b", ")", "of", "Section", "4", "of", "Article", "XIII", "of", "the", "California", "Constitution", ",", "and", "paragraphs", "(", "1", ")", "to", "(", "4", ")", ",", "inclusive", ",", "(", "6", ")", ",", "and", "(", "7", ")", "of", "subdivision", "(", "a", ")", "of", "Section", "214", "."], ["The", "exemption", "specified", "by", "subdivision", "(", "a", ")", "shall", "not", "be", "denied", "to", "a", "property", "on", "the", "basis", "that", "the", "property", "is", "used", "for", "fraternal", ",", "lodge", ",", "or", "social", "club", "purposes", "."], ["With", "regard", "to", "this", "subdivision", ",", "the", "Legislature", "finds", "and", "declares", "all", "of", "the", "following", ":", "The", "exempt", "activities", "of", "a", "veterans", "\u2019", "organization", "as", "described", "in", "subdivision", "(", "a", ")", "qualitatively", "differ", "from", "the", "exempt", "activities", "of", "other", "nonprofit", "entities", "that", "use", "property", "for", "fraternal", ",", "lodge", ",", "or", "social", "club", "purposes", "in", "that", "the", "exempt", "purpose", "of", "the", "veterans", "\u2019", "organization", "is", "to", "conduct", "programs", "to", "perpetuate", "the", "memory", "of", "deceased", "veterans", "and", "members", "of", "the", "Armed", "Forces", "and", "to", "comfort", "their", "survivors", ",", "to", "conduct", "programs", "for", "religious", ",", "charitable", ",", "scientific", ",", "literary", ",", "or", "educational", "purposes", ",", "to", "sponsor", "or", "participate", "in", "activities", "of", "a", "patriotic", "nature", ",", "and", "to", "provide", "social", "and", "recreational", "activities", "for", "their", "members", "."], ["In", "light", "of", "this", "distinction", ",", "the", "use", "of", "real", "property", "by", "a", "veterans", "\u2019", "organization", "as", "described", "in", "subdivision", "(", "a", ")", ",", "for", "fraternal", ",", "lodge", ",", "or", "social", "club", "purposes", "is", "central", "to", "that", "organization", "\u2019s", "exempt", "purposes", "and", "activities", "."], ["In", "light", "of", "the", "factors", "set", "forth", "in", "subparagraphs", "(", "A", ")", "and", "(", "B", ")", ",", "the", "use", "of", "real", "property", "by", "a", "veterans", "\u2019", "organization", "as", "described", "in", "subdivision", "(", "a", ")", "for", "fraternal", ",", "lodge", ",", "or", "social", "club", "purposes", ",", "constitutes", "the", "exclusive", "use", "of", "that", "property", "for", "a", "charitable", "purpose", "within", "the", "meaning", "of", "subdivision", "(", "b", ")", "of", "Section", "4", "of", "Article", "XIII", "of", "the", "California", "Constitution", "."], ["The", "exemption", "provided", "for", "in", "this", "section", "shall", "not", "apply", "to", "any", "portion", "of", "a", "property", "that", "consists", "of", "a", "bar", "where", "alcoholic", "beverages", "are", "served", "."], ["The", "portion", "of", "the", "property", "ineligible", "for", "the", "veterans", "\u2019", "organization", "exemption", "shall", "be", "that", "area", "used", "primarily", "to", "prepare", "and", "serve", "alcoholic", "beverages", "."], ["An", "organization", "that", "files", "a", "claim", "for", "the", "exemption", "provided", "for", "in", "this", "section", "shall", "file", "with", "the", "assessor", "a", "valid", "organizational", "clearance", "certificate", "issued", "pursuant", "to", "Section", "254.6", "."], ["This", "exemption", "shall", "be", "known", "as", "the", "\u201c", "veterans", "\u2019", "organization", "exemption", "."], ["\u201d<SECTION", "-", "HEADER><SECTION", "-", "HEADER", ">"], ["Notwithstanding", "Section", "2229", "of", "the", "Revenue", "and", "Taxation", "Code", ",", "no", "appropriation", "is", "made", "by", "this", "act", "and", "the", "state", "shall", "not", "reimburse", "any", "local", "agency", "for", "any", "property", "tax", "revenues", "lost", "by", "it", "pursuant", "to", "this", "act", "."], ["<", "SECTION", "-", "HEADER><SECTION", "-", "HEADER", ">"], ["This", "act", "provides", "for", "a", "tax", "levy", "within", "the", "meaning", "of", "Article", "IV", "of", "the", "Constitution", "and", "shall", "go", "into", "immediate", "effect", "."]], "labels": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0], "tgt": "Existing property tax law establishes a veterans \u2019 organization exemption under which property is exempt from taxation if , among other things , that property is used exclusively for charitable purposes and is owned by a veterans \u2019 organization .<q>This bill would provide that the veterans \u2019 organization exemption shall not be denied to a property on the basis that the property is used for fraternal , lodge , or social club purposes , and would make specific findings and declarations in that regard .<q>The bill would also provide that the exemption shall not apply to any portion of a property that consists of a bar where alcoholic beverages are served .<q>Section 2229 of the Revenue and Taxation Code requires the Legislature to reimburse local agencies annually for certain property tax revenues lost as a result of any exemption or classification of property for purposes of ad valorem property taxation .<q>This bill would provide that , notwithstanding Section 2229 of the Revenue and Taxation Code , no appropriation is made and the state shall not reimburse local agencies for property tax revenues lost by them pursuant to the bill .<q>This bill would take effect immediately as a tax levy ."}