Dandiset info endpoint large response sizes

mvandenburgh commented 2 years ago

The dandiset /info endpoint currently returns every metadata validation error in a list. For dandisets with large amounts of assets, this causes abnormally large response payload sizes (example which will probably freeze your browser: https://api-staging.dandiarchive.org/api/dandisets/200026/versions/draft/info/)

The validation errors need to be paginated, either within the /info endpoint or as their own endpoint.

jjnesbitt commented 2 years ago

We could paginate them, but I don't see how that would realistically be used in practice. Like, we'd need to deal with that pagination on the client somehow. It might be enough to just truncate the errors after like 10 or something, and pass a boolean flag alongside that, indicating that there are more errors after these ones are dealt with. Sort of saying "fix these errors and then you'll see what you have to fix next".

We also could keep this endpoint as a fallback for utility purposes under /info/errors. Then pagination could be implemented there (if we deem it necessary).

What do you think?

mvandenburgh commented 2 years ago

I think truncation would be fine for the client, but yeah we'd likely still want some sort of /info/errors endpoint at least so that the full list of errors could be inspected via the API if needed.

jjnesbitt commented 2 years ago

I think truncation would be fine for the client, but yeah we'd likely still want some sort of /info/errors endpoint at least so that the full list of errors could be inspected via the API if needed.

I think this would work nicely actually, with the list being initially truncated, and then we could provide a "view all errors" button that would show the rest in a separate page/view/etc.

waxlamp commented 2 years ago

I think truncation would be fine for the client, but yeah we'd likely still want some sort of /info/errors endpoint at least so that the full list of errors could be inspected via the API if needed.

I think this would work nicely actually, with the list being initially truncated, and then we could provide a "view all errors" button that would show the rest in a separate page/view/etc.

Agreed, with the addition that "show the rest" would likely be in a paginated list (again, so as not to overwhelm either the server or the client).

I think we should also consider codifying errors in a more structured way than just slamming an error message into the DB. @AlmightyYakob, we did some work along these lines in the Multinet project, and I like how we did it there. Perhaps a similar design (heavily adapted to DANDI's needs) could work here.

mvandenburgh commented 2 years ago

Coming back to this, I've noticed another issue with the way we do things that (I think) makes pagination with our current data model impossible. The serializer that returns these validation errors calls this property method to retrieve the asset validation errors for each Asset in that Version. The two major issues here are 1) we're iterating through every Asset in a Version within an endpoint, and 2) we're not only iterating through every validation error for each of those assets, we're building up a list of all of them, which will blow up memory usage if there's a lot. Because we're using JSONField here, I'm not sure if there's a way to write a query to make this more efficient (@danlamanna @AlmightyYakob any idea if that is true?). I'm more and more beginning to think that using JSONField here is incorrect, and that we need to normalize things.

satra commented 2 years ago

@mvandenburgh - this might be a place where there is a separate view/page on the app that one can link to which will do paginations and other things, instead of trying to list them on the DLP. the DLP could then simply report a summary of the state of validation. we have always wanted details on the validation errors, which currently we don't have anyway (at least not in a very usable form). and no asset error can be corrected via the DLP, so separating out that view may be more practical as well the DLP could still list dandiset metadata errors, since those are fewer. but the asset errors could be done on a separate page with pagination.

jjnesbitt commented 2 years ago

The two major issues here are 1) we're iterating through every Asset in a Version within an endpoint, and 2) we're not only iterating through every validation error for each of those assets, we're building up a list of all of them, which will blow up memory usage if there's a lot.

While we're iterating over every invalid asset in python, it's only one query, so we could easily limit the number of assets returned (say 100), and paginate from there. As satra pointed out, we could (and I'd say should) move the asset validation errors into their own endpoint, and simply call that separately from the DLP. I'd think we would order by asset path, although we could also support ordering by error count, since for each asset, you can simply annotate the count of the validation_error json field (just confirmed that locally).

Because we're using JSONField here, I'm not sure if there's a way to write a query to make this more efficient (@danlamanna @AlmightyYakob any idea if that is true?). I'm more and more beginning to think that using JSONField here is incorrect, and that we need to normalize things.

I understand the desire to normalize things, but I'm not sure what that would get us in this case, nor do I agree that JSONField is inherently an incorrect choice for this, rather I think our application of it is. One method of normalization would be to replace the asset_validation_errors property with a field, and populate that field any time an underlying asset changes. That involves a decent amount of data duplication, but leaves us with essentially the same problem regarding large amounts of data anyway.

So to sum up, I think the best approach here would be to break the asset validation errors out into its own endpoint (probably .../versions/{version}/assets/errors), and do the pagination there. You could then link to that from the DLP based on the dandiset status (valid/invalid). What do you think about that @mvandenburgh? Also @satra does that seem like a usable workflow?

satra commented 2 years ago

does that seem like a usable workflow?

i think so.

dandi / dandi-archive

Dandiset info endpoint large response sizes #1214