Applied-GeoSolutions / gips

Geospatial Image Processing System
GNU General Public License v3.0
17 stars 5 forks source link

CDL latency #504

Open bhbraswell opened 5 years ago

bhbraswell commented 5 years ago

I think that 426 days might be too long. Unless I don't understand how this variable is used, I assume we are comparing the current date with requested date (always Jan 1 of the requested year). For example I know that CDL for 2018 was just released which was a little later than normal but still only about 410 days in.

ircwaves commented 5 years ago

I am happy to defer to your recollection of what the normal CDL release date is -- I don't have any recollections of their release dates. Based on @ags-tolson comments next to the 426, he may have just been off by 30 days. Moving to 395 is probably fine.

ra-tolson commented 5 years ago

If the data's released "in February" then you can't assume Feb 1, you have to assume Feb 28. Thats 1 year + jan + feb = 365 + 31 + 29 = 426.

Who knows if that's the right way to do it, but that's what I was thinking.

ircwaves commented 5 years ago

Right. That's a solid "i'm not going to ask for things that might not be there" latency number. I think Rob has turned more toward using a "only stop me from asking for things that there is 0 possibility that they exist" number. Right @bhbraswell ?

bhbraswell commented 5 years ago

I'm just noting that with the current value of latency parameter you are almost guaranteed to have some days or weeks where CDL data are available, but GIPS will not retrieve them, for example now. And this time right after the data are made available for some people might be the most important time to get it.

I think I am partly responsible for the latency parameter but am not sure it is useful. Having a fetch fail because the data aren't ready yet, to me, is basically the same as having a fetch fail because the data were never collected.

In any case this isn't a blocker for me because I changed the value to zero in my copy. Thanks for the feedback.

ircwaves commented 5 years ago

Given your suggestion that it tends to be available in early February, and general agreement that you have a valid use case, I say it be made an env or settings configurable parameter.

Your argument for 396 or less, is solid. The only risk in going lower is annoying the data provider, and possibly getting banned. (Think Google 503s, or remember when prism blacklisted us because someone Cron job mirrored the same 2 years of data every weekend?)

bhbraswell commented 5 years ago

Thanks Ian. This is obviously not the biggest deal in the world, sorry for taking so much of your time on it. I think eventually some sort of override switch is probably the way to go.

I know CDL is sort of a weird case, but in general I think a lot of users will be interested in absolutely the lowest latency possible so maybe either trying to err on the side of low latency parameter values, or some sort of periodic review of the parameters might be useful.

ircwaves commented 5 years ago

I agree. I think that the right position is to have default latency settings that are safe-guards against likely problematic queries (CDL2018 before 2019-1-1), and environment/gips.settings configurations that will allow people to (tuorum periculo) run in a manner that might result in aggravating data providers.

The only reason for defaulting to the safeguarded mode is for naive users and for in an automated setting -- i.e. a pipeline that retries a job until it succeeds, but it isn't going to succeed for 396 days.

ra-tolson commented 5 years ago

How about:

REPOS = {
    'cdl': {
        'feb_fetch': False, # <-- default
    }
}