Open Mythos-Rudy opened 1 year ago
Yes. I see. These are legislations. This comes from the original dataset. Let's see if we can remove som exceissive \n. But note that the spaces are needed because they are meant to format the leglislation. If you don't want to wait, maybe you can just filter it out by doing a replace("\n\n", "\n")
I noticed that there are excessive line break characters ('\n') and blank spaces in a single sentence within the unified_multi_sum.jsonl file. I suspect that this data was collected from books or PDFs, and as a result, the line breaks in every line from the original sources were included in the dataset. This issue can be seen in Line 1, for instance: