LAION-AI / Open-Instruction-Generalist

Open Instruction Generalist is an assistant trained on massive synthetic instructions to perform many millions of tasks
Apache License 2.0
206 stars 19 forks source link

[unified_multi_sum.jsonl] excessive ”\n“ in this file. #12

Open Mythos-Rudy opened 1 year ago

Mythos-Rudy commented 1 year ago

I noticed that there are excessive line break characters ('\n') and blank spaces in a single sentence within the unified_multi_sum.jsonl file. I suspect that this data was collected from books or PDFs, and as a result, the line breaks in every line from the original sources were included in the dataset. This issue can be seen in Line 1, for instance:

{"text": "\n: Summarize the following proposed legislation (bill): SECTION 1. SHORT TITLE.\n\n This Act may be cited as the Patients and Public Health \nPartnership Act of 2008''.\n\nSEC. 2. DEMONSTRATION PROJECT FOR INTEGRATED HEALTH SYSTEMS TO EXPAND \n ACCESS TO PRIMARY AND PREVENTIVE SERVICES FOR THE \n MEDICALLY UNDERSERVED.\n\n Part D of title III of the Public Health Service Act (42 U.S.C. \n259b et seq.) is amended by adding at the end the following new \nsubpart:\n\nSubpart XI--Demonstration Project for Integrated Health Systems to \n Expand Access to Primary and Preventive Services for the Medically \n Underserved\n\nSEC. 340H. DEMONSTRATION PROJECT FOR INTEGRATED HEALTH SYSTEMS TO \n EXPAND ACCESS TO PRIMARY AND PREVENTIVE CARE FOR THE \n MEDICALLY UNDERSERVED.\n\n(a) Establishment of Demonstration.--\n (1) In general.--Not later than January 1, 2009, the \n Secretary shall establish a demonstration project (hereafter in \n this section referred to as the `demonstration') under which up \n to 30 qualifying integrated health systems receive grants for \n the costs of their operations to expand access to primary and \n preventive services for the medically underserved.\n(2) Rule of construction.--Nothing in this section shall \n be construed as authorizing grants to be made or used for the \n costs of specialty care or hospital care furnished by an \n integrated health system.\n (b) Application.--Any integrated health system desiring to \nparticipate in the demonstration shall submit an application in such \nmanner, at such time, and containing such information as the Secretary \nmay require.\n(c) Criteria for Selection.--In selecting integrated health \nsystems to participate in the demonstration (hereafter referred to as \n`participating integrated health systems'), the Secretary shall ensure \nrepresentation of integrated health systems that are located in a \nvariety of States (including the District of Columbia and the \nterritories and possessions of the United States) and locations within \nStates, including rural areas, inner-city areas, and frontier areas.\n (d) Duration.--Subject to the availability of appropriations, the \ndemonstration shall be conducted (and operating grants be made to each \nparticipating integrated health system) for a period of 3 years.\n(e) Reports.--\n >

huu4ontocord commented 1 year ago

Yes. I see. These are legislations. This comes from the original dataset. Let's see if we can remove som exceissive \n. But note that the spaces are needed because they are meant to format the leglislation. If you don't want to wait, maybe you can just filter it out by doing a replace("\n\n", "\n")