PrimerAI / blanc

Human-free quality estimation of document summaries
MIT License
93 stars 11 forks source link

Shannon error when doc has too many linebreaks #51

Open UntotaufUrlaub opened 1 year ago

UntotaufUrlaub commented 1 year ago

Hi,

I encountered an error using the Shannon metric. The error seems to occure if there are too many line breaks in the doc.

from blanc import Shannon
import re

doc = """nder
Election 2015: Week Ahead - The unveiling of manifestos
Election 2015: Expenditure on the NHS will be a priority, says David Gauke
Election 2015:  Voting issues for Bristol prop makers
Election 2015 smaller parties: National Health Action (NHA)
Election 2015: Northampton voters offered political mug
Election 2015: Andrew Neil's Friday campaign report
Rail fares debate: Eric Pickles v Jack Dromey
Election 2015: Andrew Neil's Thursday campaign report
Thatcher's armoured bus from Northern Ireland for sale
Election 2015 smaller parties: Liberty GB
Election 2015: Voting issues for Somerset stonemasons
Election 2015: Games and online sites about voting
Election 2015: Ed Balls talking about non-dom status
Election 2015: Andrew Neil's Wednesday campaign report
Election 2015: Voting issues for Cornwall cheese-makers
Election: Christian People's Alliance and Christian Party
Mahmood on Labour bid to abolish non-dom rules
Election 2015: Tory and Lib Dems on coalition taxes
Election 2015 smaller parties: Peace Party policies
Election 2015: Voters at National Aquarium in Plymouth
Election 2015: Opinion polls and role of focus groups
Election 2015: Priti Patel and Chris Leslie on Europe
Election 2015: Priti Patel and Chris Leslie on health
Election 2012: Market affected by hung parliament results
Election 2015 smaller parties: Community Party of Britain
Election 2015: TV viewers asked about leaders' debate
Election 2015: Andrew Neil's Thursday campaign report
Election 2015: How union members could affect vote
Election 2015: Labour or Conservative choice on economy
Election 2015: Andrew Neil's Wednesday campaign report
Election 2015: Trader on Labour's zero hours contract policy
Election 2015: Independence from Europe Party
Election 2015: Rat, hedgehog, James Bond and Joey Essex
Zero hours contract debate: Javid, Cable and Leslie
Election 2015: Andrew Neil's campaign Morning Report
Election 2015: Tax levels in UK and other countries
Election 20105: Voters views on political campaigns
Election 2015: Plaid leader Leanne Wood at party launch
Election 2015: Cannabis is Safer than Alcohol Party
Election 2015: Andrew Neil's campaign Morning Report
Election 2015: What the UK and Scottish polls predict
Election 2015 smaller parties: Mebyon Kernow
Election 2015: Issuing 650 writs to get voting started
Lucy Powell: Labour government would ban exploitative zero hour contracts
Peter Kellner: There is a "real Labour bounce" in latest poll
Alan Duncan: Cameron's third term decision 'not unwise'
How will the general election campaigns pan out?
Was it wrong for Tories to try and oust Commons speaker?
Labour's Lucy Powell clashes with presenter Andrew Neil
Famous faces: MPs retiring and leaving political stage
BBC News Timeliner hosts election archives
How many archive election broadcasts can you remember?
How does Big Ben cope with the change to summer time?
Burnham: NHS is going backwards on this government's watch
What happened to coalition predictions?
La Reine le veult: What is prorogation in Parliament?
MacKenzie: "White poor thickos" claiming the benefits
Would you want to do these jobs?
When should Prince Charles’ letters be published?
Secret ballots for future Speaker elections?
Hancock and Mahmood: Tax and national insurance pledges
Election 2015: Artist Adam Dant drawing the campaign
PMQs highlights 2010-2015: Cameron, Miliband and MPs
London Marathon bid in election run-up by Dan Jarvis MP
PMQs: Cameron on British deaths in A320 Alps air crash
PMQs: Cameron and Miliband on post-election VAT rises
PMQs: Cameron and Miliband on national insurance and taxes
PMQs: Cameron on Connarty 'standing down' at election
PMQS review: Patel and Umunna join Landale and Neil
Election: Speechwriters Collins and FinkelsteinDaily Politics highlights of 2015
Election 2015: DUP's Donaldson on hung parliament talks
What do UKIP and Green councillors think?
Brian May on Common Decency campaign
Chris Leslie on Labour election VAT pledge
Why did Cameron announce f
"""
doc_withoutLinebreaks = re.sub(r'\n', ' ', doc)
doc_firstHalf = doc[len(doc)//2]
summ = """the daily and sunday politics are on-air six days a week for much of the year reporting the political news from westminster and beyond."""

Shannon().go(doc_withoutLinebreaks, summ)
Shannon().go(doc_firstHalf, summ)
Shannon().go(doc, summ)

error message:

Traceback (most recent call last):
  File ".../shannon_error_minimal.py", line 87, in <module>
    Shannon().go(doc, summ)
  File ".../.local/lib/python3.11/site-packages/blanc/shannon.py", line 206, in go
    full_sent_lls, full_sent_success = self.measure(sent_tokens, full_prompt)
                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../.local/lib/python3.11/site-packages/blanc/shannon.py", line 124, in measure
    past = [t[:, :, :, 1:, :] for t in past]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../.local/lib/python3.11/site-packages/blanc/shannon.py", line 124, in <listcomp>
    past = [t[:, :, :, 1:, :] for t in past]
            ~^^^^^^^^^^^^^^^^
TypeError: tuple indices must be integers or slices, not tuple

The error happens for the Shannon().go(doc, summ) line, indicating that the doc without linebreaks and the shorter doc work fine. (doc and summ are taken from the aggrefact benchmark. The string in "doc" has weird start and end because I tried to reduce the example in a loop. The whole doc had the same error.)

Is this a bug, or am I supposed to pre-process the text and remove the newlines?

kind regards

OlegVasilyev4096 commented 1 year ago

Thanks, seems it is a bug, will try to fix soon

UntotaufUrlaub commented 1 year ago

Hi, I tried removing the linebreaks as temporary fix, but still encounter an example with the same error. So maybe the linebreaks are not the core of the issue. Or maybe this example has another issue.

from blanc import Shannon
import traceback
import re

docs = [
    """The competition finishes on 29 May in the same stadium. There are eight teams taking part, facing each other twice, with the top four sides qualifying for the play-offs.
You can keep up to date with all the scores, fixtures and results with BBC Sport.
Leading run-scorers:  Warner (294) Kohli (267), De Villiers (249),
Most sixes:  Warner (12) De Villiers (12), Kohli (8)
Highest score in an innings: De Kock (108), Warner (90*), Gambhir (90*)
Most wickets: McClenaghan (9), Kumar (8), Rahman (7)
* Four current or former England players will take part in the tournament:
Eoin Morgan will play for Sunrisers Hyderabad
Jos Buttler is with Mumbai Indians
Sam Billings plays for Delhi Daredevils
Kevin Pietersen will play for Rising Pune Supergiants.
Fixtures & results
(all times 15:30 BST unless stated)
Monday, 25 April
Kings XI Punjab v Mumbai Indians
Tuesday, 26 April
Sunrisers Hyderabad v Rising Pune Supergiants
Wednesday, 27 April
Delhi Daredevils v Gujurat Lions
Thursday, 28 April
Mumbai Indians v Kolkata Knight Riders
Friday, 29 April
Rising Pune Supergiants v Gujurat Lions
Saturday, 30 April
Delhi Daredevils v Kolkata Knight Riders (10:30)
Sunrisers Hyderabad v Royal Challengers Bangalore
Sunday, 1 May
Gujurat Lions v Kings XI Punjab (10:30 BST)
Rising Pune Supergiants v Mumbai Indians
Monday, 2 May
Royal Challengers Bangalore v Kolkata Knight Riders
Tuesday, 3 May
Gujurat Lions v Delhi Daredevils
Wednesday, 4 May
Kolkata Knight Riders v Kings XI Punjab
Thursday, 5 May
Delhi Daredevils v Rising Pune Supergiants
Friday, 6 May
Sunrisers Hyderabad v Gujurat Lions
Saturday, 7 May
Royal Challengers Bangalore v Rising Pune Supergiants (10:30)
Kings XI Punjab v Delhi Daredevils
Sunday, 8 May
Mumbai Indians v Sunrisers Hyderabad (10:30)
Kolkata Knight Riders v Gujurat Lions
Monday, 9 May
Kings XI Punjab v Royal Challengers Bangalore
Tuesday, 10 May
Rising Pune Supergiants v Sunrisers Hyderabad
Wednesday, 11 May
Royal Challengers Bangalore v Mumbai Indians
Thursday, 12 May
Sunrisers Hyderabad v Delhi Daredevils
Friday, 13 May
Mumbai Indians v Kings XI Punjab
Saturday, 14 May
Royal Challengers Bangalore v Gujurat Lions (10:30)
Kolkata Knight Riders v Rising Pune Supergiants
Sunday, 15 May
Mumbai Indians v Delhi Daredevils (10:30)
Kings XI Punjab v Sunrisers Hyderabad
Monday, 16 May
Kolkata Knight Riders v Royal Challengers Bangalore
Tuesday, 17 May
Rising Pune Supergiants v Delhi Daredevils
Wednesday, 18 May
Royal Challengers Bangalore v Kings XI Punjab
Thursday, 19 May
Gujurat Lions v Kolkata Knight Riders
Friday, 20 May
Delhi Daredevils v Sunrisers Hyderabad
Saturday, 21 May
Rising Pune Supergiants v Kings XI Punjab (10:30)
Gujurat Lions v Mumbai Indians
Sunday, 22 May
Kolkata Knight Riders v Sunrisers Hyderabad (10:30)
Delhi Daredevils v Royal Challengers Bangalore
Tuesday, 24 May
Qualifier 1
Wednesday, 25 May
Eliminator
Friday, 27 May
Qualifier 2
Sunday, 29 May
Final""",
]
summs = [
    """this year\'s indian premier league begins on wednesday with india\'s top 10 teams, including india, india and india.""",
]

for doc, summ in zip(docs, summs):
    try:
        Shannon().go(re.sub(r'\n', ' ', doc), summ)
    except:
        traceback.print_exc()

(experiment is still running, if there are other examples I can add them here. Let me know if you prefer the code to attached as file.)

joeliuz6 commented 9 months ago

Token indices sequence length is longer than the specified maximum sequence length for this model (1674 > 1024). Running this sequence through the model will result in indexing errors