getodk / central-backend

Node.js based backend for ODK Central
https://docs.getodk.org/central-intro/
Apache License 2.0
50 stars 75 forks source link

409 if single submission has two attachments with identical content #299

Closed matthew-white closed 3 years ago

matthew-white commented 4 years ago

To reproduce:

  1. Upload a form that contains two image fields.
  2. Fill the form. Select the same file for the two image fields.
  3. When I do this on Enketo, I receive a 409:
<OpenRosaResponse xmlns="http://openrosa.org/http/response" items="0">
  <message nature="error">A resource already exists with sha value(s) of 24f3b9b5a653c052f7d366fe90cdace088d6e3c4.</message>
</OpenRosaResponse>

I am able to modify an existing test to generate a failing case:

 it('should save given attachments', testService((service) =>
   service.login('alice', (asAlice) =>
     asAlice.post('/v1/projects/1/forms?publish=true')
       .set('Content-Type', 'application/xml')
       .send(testData.forms.binaryType)
       .expect(200)
       .then(() => asAlice.post('/v1/projects/1/submission')
         .set('X-OpenRosa-Version', '1.0')
         .attach('my_file1.mp4', Buffer.from('this is test file one'), { filename: 'my_file1.mp4' })
         .attach('xml_submission_file', Buffer.from(testData.instances.binaryType.both), { filename: 'data.xml' })
-        .attach('here_is_file2.jpg', Buffer.from('this is test file two'), { filename: 'here_is_file2.jpg' })
-        .expect(201)
+        .attach('here_is_file2.jpg', Buffer.from('this is test file one'), { filename: 'here_is_file2.jpg' })
+        .expect(201)))));

Note that the issue seems to stem from files with identical content, not files with identical filenames (for which I think there is already a test).

I have a theory that the issue is related to blobs.ensure():

https://github.com/getodk/central-backend/blob/e9ffd2c0c3aa1a9475852e1397b8259e2b03165a/lib/model/query/blobs.js#L13-L18

blobs.ensure() checks whether a blob exists, then inserts it if it does not (SELECT followed by INSERT). However, when a submission is created, its attachments are inserted into the database in parallel. I think that results in a race condition whereby the checks may happen for two attachments with identical content before either blob is inserted (instead of SELECT, INSERT, SELECT, INSERT, the order is SELECT, SELECT, INSERT, INSERT). When I add logging to blobs.ensure(), I think I see that behavior when I run the test above.

Just an idea, maybe it'd be possible to solve this using an INSERT ON CONFLICT clause?

I think this issue is part of what was happening in this forum topic:

https://forum.getodk.org/t/in-odk-central-0-8-some-submissions-the-attachmentspresent-is-different-to-attachmentsexpected-how-avoid-this/26359

It also came up in this forum topic:

https://forum.getodk.org/t/testing-form-on-central-upload-error-a-resource-already-exists/30947

issa-tseng commented 3 years ago

the issue w insert on conflict is that you have to ship all the bits whether it ends up creating the data or no. and the way openrosa works i feel like duplicate data is maybe likely. adding for update didn't do anything though :/

lognaturel commented 3 years ago

We’re looking at a likely fix for the next Collect release. Since Collect always renames media files we’ll check file hashes and use the actual same file with the same name in multiple places if the user selects the same contents.

However, this limitation is definitely not part of the spec and Enketo or other clients are likely to send the same file with different file names. Maybe the file name can be considered in the constraint?

duplicate data is maybe likely.

I don’t really understand why the constraint is there in the first place, really. Would be good to talk through what this means.

issa-tseng commented 3 years ago

no it’s just a bug

On Nov 16, 2020, at 20:39, Hélène Martin notifications@github.com wrote:

 We’re looking at a likely fix for the next Collect release. Since Collect always renames media files we’ll check file hashes and use the actual same file in multiple places for that case.

However, this limitation is definitely not part of the spec and Enketo or other clients are likely to send the same file with different file names. Maybe the file name can be considered in the constraint?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

lognaturel commented 3 years ago

This does not appear to be fixed in v1.1.1. I have reproduced with https://test.central.getodk.org/#/projects/269/forms/multiple_background/draft/testing and form instance Multiple background recordings_2021-02-08_12-57-27.zip

lognaturel commented 3 years ago

I think what I'm getting is a slightly different issue which @matthew-white mentioned in his original description. I'll open a separate issue instead.