c2corg / v6_api

REST API for https://www.camptocamp.org
GNU Affero General Public License v3.0
22 stars 26 forks source link

IntegrityError when migrating the images data from v5 to v6 #363

Closed asaunier closed 7 years ago

asaunier commented 8 years ago

When running the v5>v6 migration script, the latter crashes because

sqlalchemy.exc.IntegrityError: (psycopg2.IntegrityError) duplicate key value violates unique constraint "images_filename_key" DETAIL: Key (filename)=(1279011081_1740341383.jpg) already exists. [SQL: 'INSERT INTO guidebook.images (activities, image_type, document_id, filename) VALUES (CAST(%(activities)s AS guidebook.activity_type[]), %(image_type)s, %(document_id)s, %(filename)s)'] [parameters: {'image_type': 'personal', 'filename': '1279011081_1740341383.jpg', 'document_id': 227184, 'activities': ['snow_ice_mixed']}]

Seems related to that PR: https://github.com/c2corg/v6_api/pull/331

asaunier commented 8 years ago

Temporary workaround applied in the demo API code so that the migration passes:

diff --git a/c2corg_api/models/image.py b/c2corg_api/models/image.py
index 647d88c..03b9bef 100644
--- a/c2corg_api/models/image.py
+++ b/c2corg_api/models/image.py
@@ -49,8 +49,7 @@ class _ImageMixin(object):
     @declared_attr
     def filename(self):
         return Column(String(30),
-                      nullable=False,
-                      unique=(self.__name__ == 'Image'))
+                      nullable=False)
asaunier commented 8 years ago

Is there a chance that this issue is fixed for alpha3? I wanted to reimport the demo DB but I notice we still have this problem. If it is about to be fixed, I will wait before running the migration script. Else I could do it now.

arnaud-morvan commented 8 years ago

I do not have worked on this for now. I'm pretty sure you can reimport it now.

asaunier commented 8 years ago

Should we do something about this issue?

tsauerwein commented 8 years ago

We should check which images have this problem, and why. Maybe it could be a simple manual fix in the old database.

asaunier commented 8 years ago

DETAIL: Key (filename)=(1279011081_1740341383.jpg) already exists.

c2corg=# select id, filename from images where filename = '1279011081_1740341383.jpg';
   id   |         filename          
--------+---------------------------
 227183 | 1279011081_1740341383.jpg
 227184 | 1279011081_1740341383.jpg
(2 rows)

It seems that 2 image documents share the same filename: http://www.camptocamp.org/images/227183/fr/0007- http://www.camptocamp.org/images/227184/fr/0007- => 2 distinct documents (created with an interval of a few seconds) but both pointing to http://s.camptocamp.org/uploads/images/1279011081_1740341383.jpg

This is not an isolated case. In our copy of the v5 DB

select * from (
  SELECT id,
  filename,
  ROW_NUMBER() OVER(PARTITION BY filename ORDER BY id asc) AS row
  FROM images
) dups
where 
dups.row > 1;

returns many duplicates (sometimes there even are up to 7 documents with the same filename!).

brunobesson commented 8 years ago

The image documents having the same name, I assume this is due to a problem with the upload mechanism. It looks pretty safe to me to delete duplicates (maybe some associations could be lost, but this seems rather insignificant to me and will most certainly not occur)

asaunier commented 8 years ago

We still have this problem (the migration script has just crashed on this error because I had forgotten the workaround mentioned above).

@stef74 @gottferdom do you think we could avoid those errors in the v5 DB?

@arnaud-morvan Do we really need this unique constraint in the v6 model? Perhaps it will cause the file (including for other image documents that share it) to be deleted when the image removal service will be implemented?

brunobesson commented 8 years ago

Like I said, at first glance it seems quite safe to remove these duplicates in V5. Problem is, we don't know how they occur and thus doing a pass now doesn't ensure to have a DB in correct status when migrating data at a later time. Stupid question: could this V5 DB modification (assuming we can do it safely) be included in the begining of the migration script?

asaunier commented 8 years ago

I would prefer the changes are done in v5, including the addition of a "unique" constraint on the filename attribute.

arnaud-morvan commented 7 years ago

I've removed the unique contraint on the filename to automate the migration process. This could possibly be reverted later.

fjacon commented 7 years ago

@arnaud-morvan can we close this now ?

arnaud-morvan commented 7 years ago

We might want to get back the contraint when V5 duplicated images will be threated.

asaunier commented 7 years ago

I think it's better to keep the issue open until the golive, because it depends on actions of the associations on the v5 DB.

lbesson commented 7 years ago

What am I supposed to do? 😀

Running the detection script juste before the readonly and removing duplicates via v5 should be sufficient IMHO

asaunier commented 7 years ago

@lbesson exactly! :)

Could you confirm that existing duplicates have already been handled?

tsauerwein commented 7 years ago

It would probably be good if we already could test that the script works as expected with our migration script.

lbesson commented 7 years ago

@asaunier I can't confirm if all existing duplicates have been handled on production db, since some new could have been created since I last checked againts them (we haven't corrected the bug that created duplicates). But I can (almost - never say never) confirm that the procedure (run script to get duplicates list, then use v5 interface to delete them without deleting underlying image file) is ok

@tsauerwein It's not a script to run against a database or db dump, just some extra code in v5 to make sure that when deleting an image document, we don't delete the attached image file if the image is in fact a duplicate. So I'm unsure how this could be tested against your migration code (apart from removing remaining dups from db, then making a dump..)

asaunier commented 7 years ago

I just wanted to make sure there wasn't a bunch of duplicates still to handle :) I know that some new duplicates may be created in the meantime and that handling them when v5 is made read-only might be needed.

stef74 commented 7 years ago

@lbesson we have plan to block v5 in read only monday 5th dec and go live the 8th. We have to fix duplicated before the 6th

lbesson commented 7 years ago

handling them when just before v5 is made read-only might be needed

remember that handling the duplicates is done via v5 web interface :)

asaunier commented 7 years ago

You mean that manual actions must be done in the web interface?

arnaud-morvan commented 7 years ago

Note that we need the V5 dump as soon as possible for running the migrations. We have 3 migrations to do (topoguide, forum and images), passing it to prod environment, and some controls and eventually some fixes after that, before the GoLive. Can we plan to use the sunday to monday nightly V5 dump to start migration on monday morning. I would like the V5 to be set read only on sunday evening.

lbesson commented 7 years ago

yes, you first list the duplicates via a db script, and then use the web interface to delete them as moderator. (note that it is preferable to delete the duplicate with the highest id since it is without image meta information. That is, if you have duplicates 12345 / 12346 (ids are not always consecutive, sometimes you have more than two documents referencing the same image file), you should go to the image document 12346 on v5 web interface and delete it.

On Wed, Nov 30, 2016 at 2:51 PM, Alex Saunier notifications@github.com wrote:

You mean that manual actions must be done in the web interface?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/c2corg/v6_api/issues/363#issuecomment-263877915, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHzIIdA5-HO6IdtE15IiArthN8UTGUvks5rDX9TgaJpZM4JiHB_ .

asaunier commented 7 years ago

There were no duplicates in v5 prod DB yesterday night before when v5 was made read-only I think. Not this morning either. Should be OK, I close the issue. Thank you very much @lbesson !