ProjectSidewalk / SidewalkWebpage

Project Sidewalk web page
http://projectsidewalk.org
MIT License
84 stars 25 forks source link

Add standard deviation of image/label dates to the API #3031

Open misaugstad opened 2 years ago

misaugstad commented 2 years ago
Brief description of problem/feature

Add standard deviation of image/label dates to the API. In #2928 we added the average label/image dates to the API, but we forgot to add the standard deviation!

davphan commented 1 year ago

I'm looking at modifying the computeAccessScoresForStreets method in ProjectSidewalkAPIController.scala similarly to #2928; since the "AttributeForAccessScore" case class has its own average dates and image/label count, does each instance represent a collection of labels? And if that's the case, would do I need to find each individual image/label date from another source or can I just find the standard deviation of the list of AttributeForAccessScore dates?

misaugstad commented 1 year ago

You know, now that you mention this, I think that this issue is why we didn't include standard deviation the first time around! You're right that the attributes represent a cluster of labels, so we can't just do a standard deviation over those. It looks like the GlobalAttributeForAPI class would probably need to have the avgImageCaptureDate: Timestamp, avgLabelDate: Timestamp, imageCount: Int, and labelCount: Int variables replaced with imageCaptureDates: List[(String, Timestamp)] and labelDates: List[Timestamp] variables, where the (String, Timestamp) for imageCaptureDates is for (gsvPanoramaId, panoCaputerDate). This way, even standard deviation could be calculated across labels and images.

Of course, replacing those columns would mean that calculating them for the /attributes API would take longer. So maybe you just include both the old variables and the new ones, even if they are redundant, for the sake of performance?

davphan commented 1 year ago

After implementing the standard deviation call for just the /streets API endpoint, the time to complete the API call for the whole Pittsburgh test server went from around ~3-4 seconds to upwards of ~15-20 seconds.

Looking at implementing this into the /neighborhoods API endpoint, the time could be drastically worse. Currently, /neighborhoods calls the computeAccessScoresForStreets function used for /streets, and uses the means from there to compute a new mean for the neighborhood (Line 502). For stdev, we'd need to pull out every label meaning the streets code can't be reused, alongside being significantly slower since we have to iterate over every label in each street on top of averaging the mean dates of each street.

How important is calculating stdev for the API? Is it worth multiplying the run time for all API calls by a factor of four or more? If it's still important, are there other solutions we can implement like making stdev a completely separate API call from the existing ones, or make calculating stdev an optional parameter to give users the option to keep the current fast runtime if they dont need the stdev?

misaugstad commented 1 year ago

the time to complete the API call for the whole Pittsburgh test server went from around ~3-4 seconds to upwards of ~15-20 seconds.

I think that the only way to implement these in a way that is performant would be to rewrite the whole thing as plain SQL queries, though they would each be well over 50 lines long!

How important is calculating stdev for the API? Is it worth multiplying the run time for all API calls by a factor of four or more? If it's still important, are there other solutions we can implement like making stdev a completely separate API call from the existing ones, or make calculating stdev an optional parameter to give users the option to keep the current fast runtime if they dont need the stdev?

I believe that this is a question for @jonfroehlich, though I think that he's with family this week, so we'll probably have to wait until next week for a response.

IMO it isn't worth more of a time investment and making our APIs slower or more complicated. Though I believe that we should be reporting SD on principal, I don't think the anyone is really using these numbers right now anyway.

misaugstad commented 12 months ago

@jonfroehlich now that you're not quite as bogged down in other work, thoughts on the above?

jonfroehlich commented 12 months ago

Thanks for this great discussion. Couldn't we calculate this once/day (at night) similar to how we do nightly clustering?

But agree with this overall:

IMO it isn't worth more of a time investment and making our APIs slower or more complicated. Though I believe that we should be reporting SD on principal, I don't think the anyone is really using these numbers right now anyway.

misaugstad commented 11 months ago

I suppose that we could, yes! If we decide in the future that it's worth the time/effort to add, that's definitely how we should do it!