Open s-paquette opened 7 months ago
There is now a message box that displays
Related: #1372
Currently manifests for download are either chunked (up to 650k records) or capped at 65k (s5cmd).
If they can be chunked up to 650k records, why are they capped at 65k?
The preferred behavior would be to use BigQuery to generate the manifest
Any option that requires login would be highly undesirable.
Note that TCIA portal is capable of generating manifests that exceed what we can offer, and does not require a login. As the absolute minimum we should be able to do what TCIA does.
For the reference, NLST has a separate portal in TCIA that is available here: https://nlst.cancerimagingarchive.net/nbia-search/.
@fedorov
If they can be chunked up to 650k records, why are they capped at 65k?
Not clear why we implemented it without the chunking option available in the other downloads, but we certainly can do so. The cap would still be around 650k (10 chunks total), as this is when we begin running into paging issues in the index. This in turn impacts index performance for all users.
Any option that requires login would be highly undesirable.
This would not necessarily require a login. We would use BQ to make the manifest (offloads the work from the Solr VM and generally will result in an equivalent time no matter the manifest size), place it into a bucket for the WebApp to pick up. then return it to the user. The particulars will vary based on logged in via Google or not--if they're logged in via Google we can actually give them a link to a file with fine-grained permissions they can download, not logged in or logged in without Google it would need to be a long poll operation but still workable.
Note that TCIA portal is capable of generating manifests that exceed what we can offer, and does not require a login. As the absolute minimum we should be able to do what TCIA does.
This is a difficult comparison because with TCIA, near as I can tell you can only generate a manifest by adding things to the cart, and you can only add to the cart in blocks of 500 cases. Hitting 65k series takes a significant amount of time in this method, especially since a given addition of 500 cases to the cart can sometimes take upwards of 20 seconds (or did for me), longer as you move deeper into the record set. It took me several minutes to get to a 10k series manifest.
This is the TCIA site running into the same problem we do--'deep paging', where you're asking the index to produce a set of several thousands records from a relatively arbitrary spot in the index.
In our portal the 500 block restriction doesn't exist; you can download all series matching a filter set in one go. This makes it very easy to grab tens of thousands of series very quickly, but the down side is we run into the deep paging at manifest generation time. TCIA is basically front-loading the delay into the main UI by requiring a user to go through the slow process of queueing up all of those thousands of series. We CAN switch to this methodology, though I don't know that we should or need to; other options exist, including pre-generated manifests for collections and our planned cart selection system.
This is a difficult comparison because with TCIA, near as I can tell you can only generate a manifest by adding things to the cart
You can also do "Download > Download query", which I believe will save a series-level manifest. I know this worked for me when I tried last week, but now that I try it, nothing is happening.
@fedorov For Download Query on either portal I'm told I must install specific software to use that, so it's using a different system entirely to download the manifest.
I believe that software referenced is needed in order to download the files referenced in the manifest, and not to download the manifest.
@fedorov Ah that makes more sense. Then we're back to the same issue--in TCIA making a manifest that actually hits our cap is done in a way which would take several minutes to perform, because they're forcing a user to add the series in manageable blocks. We're grabbing the entire block in one move, which is why we have to limit how large it is. We can change to their method, and effectively will when we move to our own cart system.
@s-paquette did you try what I suggested?
They are not requiring the users to add stuff to the cart in chunks. I am not suggesting to download the cart. They allow user to set the filter, and then use "Download the query" feature to download the manifest of the content that corresponds to the current selection.
You can try this yourself using the steps I recorded in the screenshot posted earlier.
@fedorov When I click Download Query with a large number of series it takes about 30 seconds for me to get a response from their portal initially (if there are restrictions on use) and then another 30 or so for the actual file:
This is for a manifest only containing 54k entries--so it also doesn't reach our 65k cap. (There's also no feedback for me that anything is going on, but that might be due to my adblocker killing a popup they might have, or possibly it's just not working in Chrome.)
Because we have a 30s timeout maximum on the load balancer, we need to implement a solution in which we can poll and respond back to the user if a response takes more than 30s. 65k is the number we came up with to safely stay under this in the mean time. As mentioned above a polling system is entirely doable, but the response will be slower the more entries are in the file, and it will impact Solr performance if it's run there as opposed to fired off from BQ and pulled out of that (and returned to the user in the expected format).
Ok, I just wanted to make sure you exercised that feature of the TCIA portal.
I personally am not a fan of chunked manifests, at all. I would prefer an asynchronous manifest creation using BQ or anything really, giving use the link whenever it is ready. I think it is important that we provide this feature without requiring the user to log in.
If there is a pending manifest request, and the user attempts to navigate away from the page, warn the user and ask to confirm they are ok to lose that request.
Based on the discussion today, agreed:
Going forward, we discussed it should not be too much work to add an option to switch between study- vs series-level selection, which is tracked in a separate issue (can't find it).
TODO:
Currently manifests for download are either chunked (up to 650k records) or capped at 65k (s5cmd). The preferred behavior would be to use BigQuery to generate the manifest and then provide a link for downloading.
This manifest could be optionally GZipped (part of the BQ export-to-file option), allowing it to remain small. The file would be located in a bucket with list permissions disabled, allowing for a single access to the specific request.