acl-org / acl-anthology

Data and software for building the ACL Anthology.
https://aclanthology.org
Apache License 2.0
406 stars 280 forks source link

Ingestion of new proceedings causes inconsistency in the database #42

Closed villalbamartin closed 6 years ago

villalbamartin commented 6 years ago

When adding data for a new conference, and under some not-yet-determined circumstances, the database can end up in an inconsistent state. This is evidenced by error pages in the search functionality, as seen in issues #32, #39, and others.

knmnyn commented 6 years ago

I've just tried to correct errors with W17-35 INLG 2017 which requires re-ingesting the all of the works from W17. This brought up problems with certain consistency checks that seem to have been recently added.

aclanthology@aclanthology:~/acl-anthology/public/pdf/W$ rake import:xml[true,"W17"] (in /home/aclanthology/acl-anthology) Seeding individual volume: W17. PG::ForeignKeyViolation: ERROR: update or delete on table "papers" violates foreign key constraint "papers_people_paper_id_fkey" on table "papers_people" DETAIL: Key (id)=(52374) is still referenced from table "papers_people". : DELETE FROM papers WHERE volume_id IN (SELECT id FROM volumes WHERE anthology_id LIKE 'W17%'); rake aborted! ActiveRecord::InvalidForeignKey: PG::ForeignKeyViolation: ERROR: update or delete on table "papers" violates foreign key constraint "papers_people_paper_id_fkey" on table "papers_people" DETAIL: Key (id)=(52374) is still referenced from table "papers_people". : DELETE FROM papers WHERE volume_id IN (SELECT id FROM volumes WHERE anthology_id LIKE 'W17%'); /home/aclanthology/.rvm/gems/ruby-2.0.0-p353@acl/gems/activerecord-4.0.1/lib/active_record/connection_adapters/postgresql/database_statements.rb:128:in exec' /home/aclanthology/.rvm/gems/ruby-2.0.0-p353@acl/gems/activerecord-4.0.1/lib/active_record/connection_adapters/postgresql/database_statements.rb:128:inblock in execute' /home/aclanthology/.rvm/gems/ruby-2.0.0-p353@acl/gems/activerecord-4.0.1/lib/active_record/connection_adapters/abstract_adapter.rb:435:in block in log' /home/aclanthology/.rvm/gems/ruby-2.0.0-p353@acl/gems/activesupport-4.0.1/lib/active_support/notifications/instrumenter.rb:20:ininstrument' /home/aclanthology/.rvm/gems/ruby-2.0.0-p353@acl/gems/activerecord-4.0.1/lib/active_record/connection_adapters/abstract_adapter.rb:430:in log' /home/aclanthology/.rvm/gems/ruby-2.0.0-p353@acl/gems/activerecord-4.0.1/lib/active_record/connection_adapters/postgresql/database_statements.rb:127:inexecute' /home/aclanthology/acl-anthology/lib/tasks/xml_import.rake:330:in block (2 levels) in <top (required)>' /home/aclanthology/.rvm/gems/ruby-2.0.0-p353@acl/bin/ruby_executable_hooks:15:ineval' /home/aclanthology/.rvm/gems/ruby-2.0.0-p353@acl/bin/ruby_executable_hooks:15:in `

' Tasks: TOP => import:xml (See full trace by running task with --trace)

knmnyn commented 6 years ago

We need to fix this problem soon as without the ability to reingest a volume (say workshops from W17) which used to work as of one month ago, we can't ingest new proceedings or update ones.

@villalbamartin can you try ingesting the proceedings in the Saarlands VM at import/W17.xml?

villalbamartin commented 6 years ago

I'm on it. While it's not nice that something broke, we were expecting something like this to happen with the new checks.

knmnyn commented 6 years ago

Great, thanks. Exactly as you said we expected this to happen - and now we can go about figuring how to fix it. :)

On Thu, 9 Nov 2017 at 17:38, villalbamartin notifications@github.com wrote:

I'm on it. While it's not nice that something broke, we were expecting something like this to happen with the new checks.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/acl-org/acl-anthology/issues/42#issuecomment-343099649, or mute the thread https://github.com/notifications/unsubscribe-auth/AANP6zKWeWLVLU4tVJzjxfT2nWLZ7GNDks5s0sgMgaJpZM4QRYIi .

--

  • M
villalbamartin commented 6 years ago

Documenting this bug before I attempt to squash it.

The error is triggered by this line, which deletes all papers (if any) that belong to the volume that it's about to be imported - those papers either do not exist and nothing will be deleted, or they will be re-inserted briefly afterwards.

The problem is the table papers_people, which relates papers and authors. This table remains untouched, leading to the following sequence of events:

These are the two steps I'll attempt to correct this:

villalbamartin commented 6 years ago

I've now added a line to delete the proper records, and ran rake import:xml[true,"W17"] without error. @knmnyn, could you confirm that things are working as expected? Note that I didn't run the entire ingestion pipeline, only the line that you mentioned in your report.

knmnyn commented 6 years ago

@villalbamartin Thanks, that looks like it worked!

villalbamartin commented 5 years ago

Update: the consistency check disappeared when we recreated the database at some point, and yet the modification we did to xml_import.rake seems to have solved the problem, so I don't see the need to add an extra constraint to the database that we apparently don't need. So even though the fix is real and still around, the database consistency check is no longer there.

knmnyn commented 5 years ago

Ok. Thanks, @villalbamartin !