Open UntotaufUrlaub opened 1 year ago
Thanks, seems it is a bug, will try to fix soon
Hi, I tried removing the linebreaks as temporary fix, but still encounter an example with the same error. So maybe the linebreaks are not the core of the issue. Or maybe this example has another issue.
from blanc import Shannon
import traceback
import re
docs = [
"""The competition finishes on 29 May in the same stadium. There are eight teams taking part, facing each other twice, with the top four sides qualifying for the play-offs.
You can keep up to date with all the scores, fixtures and results with BBC Sport.
Leading run-scorers: Warner (294) Kohli (267), De Villiers (249),
Most sixes: Warner (12) De Villiers (12), Kohli (8)
Highest score in an innings: De Kock (108), Warner (90*), Gambhir (90*)
Most wickets: McClenaghan (9), Kumar (8), Rahman (7)
* Four current or former England players will take part in the tournament:
Eoin Morgan will play for Sunrisers Hyderabad
Jos Buttler is with Mumbai Indians
Sam Billings plays for Delhi Daredevils
Kevin Pietersen will play for Rising Pune Supergiants.
Fixtures & results
(all times 15:30 BST unless stated)
Monday, 25 April
Kings XI Punjab v Mumbai Indians
Tuesday, 26 April
Sunrisers Hyderabad v Rising Pune Supergiants
Wednesday, 27 April
Delhi Daredevils v Gujurat Lions
Thursday, 28 April
Mumbai Indians v Kolkata Knight Riders
Friday, 29 April
Rising Pune Supergiants v Gujurat Lions
Saturday, 30 April
Delhi Daredevils v Kolkata Knight Riders (10:30)
Sunrisers Hyderabad v Royal Challengers Bangalore
Sunday, 1 May
Gujurat Lions v Kings XI Punjab (10:30 BST)
Rising Pune Supergiants v Mumbai Indians
Monday, 2 May
Royal Challengers Bangalore v Kolkata Knight Riders
Tuesday, 3 May
Gujurat Lions v Delhi Daredevils
Wednesday, 4 May
Kolkata Knight Riders v Kings XI Punjab
Thursday, 5 May
Delhi Daredevils v Rising Pune Supergiants
Friday, 6 May
Sunrisers Hyderabad v Gujurat Lions
Saturday, 7 May
Royal Challengers Bangalore v Rising Pune Supergiants (10:30)
Kings XI Punjab v Delhi Daredevils
Sunday, 8 May
Mumbai Indians v Sunrisers Hyderabad (10:30)
Kolkata Knight Riders v Gujurat Lions
Monday, 9 May
Kings XI Punjab v Royal Challengers Bangalore
Tuesday, 10 May
Rising Pune Supergiants v Sunrisers Hyderabad
Wednesday, 11 May
Royal Challengers Bangalore v Mumbai Indians
Thursday, 12 May
Sunrisers Hyderabad v Delhi Daredevils
Friday, 13 May
Mumbai Indians v Kings XI Punjab
Saturday, 14 May
Royal Challengers Bangalore v Gujurat Lions (10:30)
Kolkata Knight Riders v Rising Pune Supergiants
Sunday, 15 May
Mumbai Indians v Delhi Daredevils (10:30)
Kings XI Punjab v Sunrisers Hyderabad
Monday, 16 May
Kolkata Knight Riders v Royal Challengers Bangalore
Tuesday, 17 May
Rising Pune Supergiants v Delhi Daredevils
Wednesday, 18 May
Royal Challengers Bangalore v Kings XI Punjab
Thursday, 19 May
Gujurat Lions v Kolkata Knight Riders
Friday, 20 May
Delhi Daredevils v Sunrisers Hyderabad
Saturday, 21 May
Rising Pune Supergiants v Kings XI Punjab (10:30)
Gujurat Lions v Mumbai Indians
Sunday, 22 May
Kolkata Knight Riders v Sunrisers Hyderabad (10:30)
Delhi Daredevils v Royal Challengers Bangalore
Tuesday, 24 May
Qualifier 1
Wednesday, 25 May
Eliminator
Friday, 27 May
Qualifier 2
Sunday, 29 May
Final""",
]
summs = [
"""this year\'s indian premier league begins on wednesday with india\'s top 10 teams, including india, india and india.""",
]
for doc, summ in zip(docs, summs):
try:
Shannon().go(re.sub(r'\n', ' ', doc), summ)
except:
traceback.print_exc()
(experiment is still running, if there are other examples I can add them here. Let me know if you prefer the code to attached as file.)
Token indices sequence length is longer than the specified maximum sequence length for this model (1674 > 1024). Running this sequence through the model will result in indexing errors
Hi,
I encountered an error using the Shannon metric. The error seems to occure if there are too many line breaks in the doc.
error message:
The error happens for the
Shannon().go(doc, summ)
line, indicating that the doc without linebreaks and the shorter doc work fine. (doc and summ are taken from the aggrefact benchmark. The string in "doc" has weird start and end because I tried to reduce the example in a loop. The whole doc had the same error.)Is this a bug, or am I supposed to pre-process the text and remove the newlines?
kind regards