dandi / dandiarchive-legacy

Code for the DANDI Web app
https://dandiarchive.org
Apache License 2.0
9 stars 9 forks source link

Q: best strategy to avoid empty items upon upload #358

Closed yarikoptic closed 3 years ago

yarikoptic commented 4 years ago

ATM dandi-cli first creates an item into which it uploads a file using uploadFileToItem. Those items are immediately visible in web UI with 0 files/bytes in them. Upload might get interrupted (e.g. severed connection) etc, leaving an empty item behind. web files view shows empty items for the duration of upload, without any indication that they are being uploaded or just that they actually lack content. It is confusing and ideally either "being uploaded" items do not even appear listed in the web view until they are uploaded, or at least display some status message (e.g. "uploading", "no data", or alike).

What would you recommend @mgrauer ? E.g. is there some typical girder-based solution for e.g. having an item uploaded elsewhere with some trigger to "clean up" if upload is interrupted, or moving into target location upon completion? Alternatively dandi-cli could just populate new item metadata with some field (e.g. upload-status) which would if present instruct web ui to avoid showing it or showing with some indicator in the listing.

yarikoptic commented 4 years ago

I could possibly even keep updating that metadata record with the upload progress once in a while, so web ui could nicely show the progress of the uploading process.

mgrauer commented 4 years ago

I don't think it has been an issue for any use cases so far, you are the first one I've heard about this from. I wonder if this is because you are constantly re-uploading many large files or perhaps some other reason.

Where does this cause a problem for you specifically?

You can currently see when you have checked the box next to an item, in the detailed menu that displays to the right of the checked item, that the item has 0 or 1 or however many files, and you can see the number of bytes in that item. Is that not sufficient for you to see what's going on?

yarikoptic commented 4 years ago

I don't think it has been an issue for any use cases so far

I am indeed the first one (whoohoo!) to report it. We actually simply don't know if that is an issue. We might have incomplete uploads in drafts and nobody would discover until they try to download them all. I have never done that.

We would will also discover that to be an issue e.g. when we "publish". Empty items must not be a part of the release!

I wonder if this is because you are constantly re-uploading many large files or perhaps some other reason.

I have not uploaded anything for quite a while. Details of my specific use case: I have asked a user to upload sample files to troubleshoot organize. After awhile I went to the dandiset page and saw files (well -- items) listed there, and since I needed only a few, I have asked user to interrupt upload. Then I had tried to download that dandiset using dandi-cli and it just reported bunch of "detected empty item" without a single file being there. So I had to ask to reupload.

Where does this cause a problem for you specifically?

As described above -- a visitor (or me or us for that matter) of a draft dataset has no clue either actual data had been successfully uploaded or not. I do not think it is a desired property of the archive/platform thus was seeking your ideas on how we could overcome it.

You can currently see when you have checked the box next to an item

I know that. But, are you suggesting that I (or any user) needs to go through each file and perform this dance to find the answer to the question either data is actually there?

mgrauer commented 4 years ago

I know that. But, are you suggesting that I (or any user) needs to go through each file and perform this dance to find the answer to the question either data is actually there?

I was just trying to understand where the particular pain comes in. This next comment helped me follow where the real pain was and why looking at the individual item would be a very difficult dance indeed, as someone would have to look through all items (I thought you were concerned with the level of an individual item rather than at the dandiset level) to find what they were looking for.

As described above -- a visitor (or me or us for that matter) of a draft dataset has no clue either actual data had been successfully uploaded or not. I do not think it is a desired property of the archive/platform thus was seeking your ideas on how we could overcome it.

How about this as a solution? When you are looking at a Dandiset Landing Page (such as this example), it shows you the number of direct folder and item children of the top level Girder Folder of that Dandiset, but doesn't include any of the recursive child Folders, so we always see something like Files: 0, Folders: 15. If we were to populate the Dandiset Landing Page with the full recursive count of folders/items/bytes, would that help?

yarikoptic commented 4 years ago

If we were to populate the Dandiset Landing Page with the full recursive count of folders/items/bytes, would that help?

I think populating landing page with the stats (folders/items/bytes) is a good idea in general and provides are least partial resolution for #38. But I do not see how it would tell me if there is any item which is missing a file. If while collecting stats you also collect some additional stats (e.g. # of items without files, lets call them "incomplete files") -- that would be the indicator I need. While preparing such stats collection please keep it open to future extensions, so solely based on metadata records of the items we could pick out also "# of files without validation" etc. But again -- this is a solution on the server side. I still wonder if there is some girder magic/workflow which would simply allow me to avoid such cases altogether on the client side?

mgrauer commented 4 years ago

At least we've made some progress on what server side improvements could help you :-)

I still wonder if there is some girder magic/workflow which would simply allow me to avoid such cases altogether on the client side?

I'm not following what are you looking for here. Do you want the client to call an endpoint and report something, or do you want the client to compare what it has locally with what has been uploaded, or something else?

yarikoptic commented 4 years ago

At least we've made some progress on what server side improvements could help you :-)

Any specific pointer/change?

Re client: let's just chat some time

mgrauer commented 4 years ago

Any specific pointer/change?

38, #363

yarikoptic commented 4 years ago

Ah, I thought I had missed some PR ;-)

waxlamp commented 3 years ago

I think this is mooted by the migration. Reopen if not.