Preprocess script preprocess/ext_label_and_tokenize.py returns different results from processed CNN/DailyMail

Thanks for your great work!

After executing preprocess/ext_label_and_tokenize.py on raw CNN/DailyMail dataset using the following command:

python ext_lable_and_tokenize.py --raw_path [SOME PATH]/CoLo/extractive/datasets/raw_CNNDM --save_path [SOME PATH]/CoLo/extractive/datasets/preprocesssed_CNNDM --max_src_ntokens 512

The results are different from the processed CNN/DailyMail. For example, in article_id 34:

Processed CNN/DailyMail: {"article_id": "34", "text_id": [0, 50266, 1640, 16256, 43, 243, 21, 10, 6097, 1524, 183, 13, 5, 663, 589, 9, 4533, 5577, 165, 6, 53, 172, 4854, 376, 31, 11352, 4, 50265, 50266, 133, 1310, 21, 12022, 548, 944, 16235, 1777, 751, 9, 312, 4, 3217, 419, 273, 662, 4, 50265, 50266, 133, 165, 18, 4293, 300, 583, 5, 15261, 6, 77, 6017, 10, 37526, 9, 3102, 25520, 4373, 31, 5, 514, 8, 439, 15, 5, 908, 6, 103, 190, 164, 88, 5, 4293, 4, 50265, 50266, 20158, 919, 15692, 12717, 1602, 5, 1151, 9, 5231, 35, 22, 133, 3539, 21, 27325, 5435, 15, 127, 5856, 4, 50265, 50266, 243, 21, 98, 25645, 14, 38, 1705, 75, 120, 10, 11155, 15, 24, 72, 50265, 50266, 510, 26591, 24509, 23, 8351, 854, 5270, 871, 6, 22, 975, 5270, 6, 120, 24, 160, 162, 2901, 50265, 50266, 42948, 6, 117, 3236, 268, 58, 1710, 148, 5, 17702, 6, 53, 5, 670, 11362, 9, 3539, 24433, 3215, 11, 5, 3423, 14723, 4, 50265, 50266, 14287, 939, 22026, 10679, 7523, 11998, 19898, 18, 569, 1065, 4, 50265, 2], "cls_ids": [1, 28, 47, 82, 104, 120, 140, 166], "summary": ["Rowing team at Washington University attacked by flying carp .", "Member of the team caught the attack on video ."], "text": ["(CNN)It was a typical practice day for the Washington University of rowing team, but then danger came from beneath.", "The scene was Creve Coeur Lake outside of St. Louis early Friday morning.", "The team's boat got near the dock, when suddenly a swarm of Asian carp emerged from the water and went on the attack, some even going into the boat.", "Team member Devin Patel described the moment of terror: \"The fish was flopping on my legs.", "It was so slippery that I couldn't get a grip on it.\"", "Patel screamed at teammate Yoni David, \"Yoni, get it off me!\"", "Thankfully, no rowers were injured during the ordeal, but the strong smell of fish lingered in the moments afterward.", "Watch iReporter Benjamin Rosenbaum's video above."], "labels": [1, 0, 1, 0, 0, 0, 0, 0]}
My results: {"article_id": "34", "text_id": [0, 2], "cls_ids": [], "summary": ["R", "o", "w", "i", "n", "g", "", "t", "e", "a", "m", "", "a", "t", "", "W", "a", "s", "h", "i", "n", "g", "t", "o", "n", "", "U", "n", "i", "v", "e", "r", "s", "i", "t", "y", "", "a", "t", "t", "a", "c", "k", "e", "d", "", "b", "y", "", "f", "l", "y", "i", "n", "g", "", "c", "a", "r", "p", "", ".", "", "M", "e", "m", "b", "e", "r", "", "o", "f", "", "t", "h", "e", "", "t", "e", "a", "m", "", "c", "a", "u", "g", "h", "t", "", "t", "h", "e", "", "a", "t", "t", "a", "c", "k", "", "o", "n", "", "v", "i", "d", "e", "o", "", "."], "text": [], "labels": []}

Both raw and processed CNN/DailyMail are from your provided links on this repo.

Can you suggest what I've done wrong?

Thanks in advanced!

ChenxinAn-fdu / CoLo

Preprocess script preprocess/ext_label_and_tokenize.py returns different results from processed CNN/DailyMail #5