lognaturel commented 3 years ago

Closes #886

Prior to this change, submissions whose instanceIDs were in the instance database were filtered out when doing a pull from Central. The problem is that presence in the db does not mean any file was successfully saved for that instance. One change I made to address this is to not save an instanceID to the database if the pull task has been cancelled. I made this change to pull from Aggregate as well. What this should mean is that if a pull is cancelled, the submission(s) that are being downloaded at the time should not be skipped on a subsequent pull.

For Central, I also changed when the database is checked for an instance. Prior to this change, if an instanceID were in the database, the submission would be skipped entirely. Now, if the submission is in the database, we:

check for the presence of a submission file on disk and download it if it doesn't exist
download the attachment list
for each attachment, check if it exists on disk and download it if not

This means that there's going to be an extra disk check and an extra request for every single submission to get the attachment list. As a result, subsequent pulls will be slower, even for forms with no attachments (we still need to make the request to know there are no expected attachments). I think the tradeoff is worth it to ensure getting all media files in case of "success with errors".

I did not make the corresponding change for Aggregate. This is because there is no API endpoint to get a submission attachment listing. Getting it requires pulling the submission contents as well. We know we have Aggregate users for whom performance is quite important so the tradeoff did not seem worth it in this case. This means that when there's a "success with errors" case in Aggregate, some media could be missing even after a subsequent pull.

What has been done to verify that this works as intended?

Pulled from https://test.central.getodk.org/#/projects/149/forms/build_media_1556624635/submissions, cancelled, verified some media was missing, pulled again, verified the media was present. I also manually deleted all contents of an instance's directory, just the submission file, just an attachment and confirmed that the missing files were downloaded on a subsequent pull.

Why is this the best possible solution? Were any other approaches considered?

See discussion of tradeoffs above. I also changed the logging messages for submission attachments a little bit to visually link them to their submission and to provide a submission filename and instanceID in case of failure so those are easier to trace.

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

The biggest risk is that subsequent pulls from Central could be too slow to be practical in certain contexts. I don't think that's the case but it would depend on connection latency to a specific server and disk speed.

I did have to make a number of changes to the pull from Central flow so there's regression risk there. There should be no risk to pull from Aggregate because the only change I made was an added check on whether the pull was cancelled.

Does this change require updates to documentation? If so, please file an issue at https://github.com/getodk/docs/issues/new and include the link below.

No.

mmarciniak90 commented 3 years ago

@lognaturel I noticed that the second pull overwrites previously downloaded media. I observe that the modified date is changed. The funniest thing is that only I see this behavior on macOS and Ubuntu. @kkrawczyk123 didn't observe it.

lognaturel commented 3 years ago

Thanks, @mmarciniak90. That's definitely not supposed to happen.

mmarciniak90 commented 3 years ago

I hope you can reproduce it :crossed_fingers:

codecov-io commented 3 years ago

Codecov Report

Merging #887 into master will increase coverage by 0.39%. The diff coverage is 84.37%.

@@             Coverage Diff              @@
##             master     #887      +/-   ##
============================================
+ Coverage     47.14%   47.53%   +0.39%     
- Complexity     1626     1646      +20     
============================================
  Files           196      196              
  Lines         10547    10550       +3     
  Branches        763      767       +4     
============================================
+ Hits           4972     5015      +43     
+ Misses         5222     5184      -38     
+ Partials        353      351       -2

Impacted Files	Coverage Δ	Complexity Δ
...it/briefcase/pull/aggregate/PullFromAggregate.java	`69.14% <50.00%> (-0.40%)`	`28.00 <0.00> (ø)`
...briefcase/pull/central/PullFromCentralTracker.java	`70.19% <83.33%> (+11.92%)`	`24.00 <4.00> (+4.00)`
...atakit/briefcase/pull/central/PullFromCentral.java	`73.33% <87.50%> (+6.41%)`	`27.00 <10.00> (+8.00)`
...rg/opendatakit/briefcase/util/FileSystemUtils.java	`44.76% <0.00%> (+1.90%)`	`21.00% <0.00%> (+2.00%)`
...it/briefcase/reused/http/response/ClientError.java	`77.27% <0.00%> (+9.09%)`	`10.00% <0.00%> (+2.00%)`
.../org/opendatakit/briefcase/util/DatabaseUtils.java	`54.16% <0.00%> (+11.45%)`	`16.00% <0.00%> (+4.00%)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 98fb98e...8a010b1. Read the comment docs.

lognaturel commented 3 years ago

@mmarciniak90 I added some automated tests to verify that if a submission.xml file or submission attachment file exists after a pull attempt, it does not get re-downloaded. I don't see timestamps changing in macOS. I also ran it in debug mode and verified there's only a call to download an attachment if either the submission isn't in the database or the attachment file doesn't exist on disk.

Is it possible that you cleared your instances database between pulls somehow? I can reproduce the behavior you describe by deleting the info.hsqldb folder for my form, restarting Briefcase, and then pulling again. That behavior should be the same as in v1.17.4 and prior. That seemed like the safest approach to me -- if the db doesn't know about a specific instance, something weird happened so it makes sense to me to download all of it.

There's one more edge case I can think of. If you download a submission with many attachments and cancel part way, that submission won't be written to the db. So next time you do a pull, all of its attachments will be downloaded again. Again, the thinking there is that if there was a cancel part way through, we can't be confident of the integrity of the submission on disk so let's redownload all of it.

mmarciniak90 commented 3 years ago

Is it possible that you cleared your instances database between pulls somehow?

I recorded my steps. Take a look.

kkrawczyk123 commented 3 years ago

And it's the second thing that we have conflict about is I can't see that replacing that @mmarciniak90 is encountering on my device. The only difference that we are aware of is that I am on Ubuntu 16 and Marzena uses 18. Peek 2020-10-29 12-21

lognaturel commented 3 years ago

@mmarciniak90 this looks like the last edge case I mentioned. I see that you cancel after two media files are downloaded even though there are more. That means the submission is not recorded in the database as being downloaded. So the next time you pull, this whole submission is pulled again regardless of what is on disk. Does that make sense?

I’m going to make another quick change so “Success” doesn’t show up after a cancel, that’s annoying.

mmarciniak90 commented 3 years ago

Does that make sense?

Yes, so, media are not redownloaded only if the whole submission has been downloaded before cancelation.

Peek 2020-10-29 16-23

lognaturel commented 3 years ago

There's some amount of parallelism so I think what's going on is that your cancel stopped things before a few of the submissions were marked as downloaded. If you let the download go all the way the first time, are there any media files redownloaded the second time?

The other thing to try is that even if you let the download go all the way, if you then delete one media file in any instance, it should be redownloaded. The other media files in that instance should not be redownloaded. If those two things work, I'm pretty confident that what you're experiencing is just several submissions that were cancelled before they were marked as completely pulled.

lognaturel commented 3 years ago

media are not redownloaded only if the whole submission has been downloaded before cancelation.

Reading this over now, I think you're saying the behavior is as-expected? That is, media are redownloaded for submissions that were not fully downloaded. For submissions that were fully downloaded, media is not redownloaded. If that's the case, all is well. And my comment above explains why there might be more than one incomplete submission after a cancel.

kkrawczyk123 commented 3 years ago

I was also able to reproduce the edge case. We have verfied on Ubuntu and MacOS different pull and cancel scenarios form Central and Aggregate. As the situation is clear now we should merge it and start regression testing as fast as possible.

@getodk-bot unlabel "needs testing" @getodk-bot label "behavior verified"

getodk / briefcase

When pulling from Central, download submission or attachment files that don't exist #887

What has been done to verify that this works as intended?

Why is this the best possible solution? Were any other approaches considered?

How does this change affect users? Describe intentional changes to behavior and behavior that could have accidentally been affected by code changes. In other words, what are the regression risks?

Does this change require updates to documentation? If so, please file an issue at https://github.com/getodk/docs/issues/new and include the link below.

Codecov Report