FETCH all data? - Githubissues

ozym commented 1 year ago

Would it be useful to have a mechanism to say FETCH all the data from a server in one hit?

(e.g. without having to use an INFO streams, do the parsing and then make a request).

crotwell commented 1 year ago

There is a FETCH with no arguments allowed. I am guessing that is what this does, but should be clarified in the docs. You would presumably still have to do a STATION and a SELECT, but those could be wildcarded I think.

ozym commented 1 year ago

If seq is -1 or omitted, then transfer starts from the next available packet.

This is for "DATA [seq]" but "FETCH [seq]" has the same meaning for seq. The "-1" is a magic word (number) to mean the next packet, but there isn't a magic number/word to indicate the first available packet.

crotwell commented 1 year ago

This seems like a problem, since FETCH is supposed to not wait for any additional new data. So a FETCH -1 would finish immediately with no data returned by that definition I think?

Perhaps for FETCH it should be:

If seq is -1 or omitted, then transfer starts from the earliest queued data.

crotwell commented 1 year ago

@andres-h can you comment on the expected behavior for FETCH -1 or just FETCH and if it should be different from DATA -1 and DATA?

andres-h commented 1 year ago

Looks like this was lost in the description, but FETCH -1 or just FETCH was supposed to return at least one packet as it does in legacy SeedLink. So it would wait for the next packet coming in.

The purpose of FETCH is cyclic transmission (originally for dial-up links). If FETCH -1 would return immediately, it would be impossible to start cyclic transmission without having a sequence number beforehand (eg., from INFO).

I'm not sure if fetching all data blindly would be useful. The legacy SeedLink does have (probably undocumented) DATA ALL and FETCH ALL commands, but I guess it would be better to set some limits with a time window.

chad-earthscope commented 1 year ago

Even if FETCH is cyclic it must start somewhere, which I believe is the intention of FETCH -1. So that could be repeated, a client submitted FETCH -1, got a END (~~BYE~~) response, and then submits FETCH -1 again in the future. With the whole concept of FETCH, waiting, potentially indefinitely, for a "next" packet does not make sense.

A bigger question: what is the use case for FETCH? It seems most (all?) features of a "dial-up" mode can be implement by the client using DATA.

The legacy SeedLink does have (probably undocumented) DATA ALL and FETCH ALL commands

Slightly off topic. I either never knew this or forgot, so no support in libslink. In my own implementations I've used "uni-station" mode as an all-stations mode (i.e. DATA without a STATION) because it matches most use of uni-station mode, and is such a useful mode. I'm considering implementing this again for v4 and would prefer it to be part of the standard if others agreed.

In the current draft, this all-stations mode would be a shortcut for a selecting all stations with wildcards. Currently, submitting DATA without a STATION is an error, and logically means "stream no selected data", which is useless. The shortcut is a logical change of "selecting none" to "selecting all".

crotwell commented 1 year ago

I was assuming the use of FETCH was quick connect, get all data ready and available, then disconnect, like for use over a high cost transmission line where connection time should be minimized. So how would a client that wants to recover the currently queued data, but not wait for additional packets use DATA? How would it know when the stream was finished and so know it was time to disconnect?

Is this the main (only?) use case for FETCH? And, just question, but is this a use case seedlink4 must support?

While "get all" might be reasonable if the remote system has limited storage, it could be maybe a bit dangerous connecting to a data center with large amount of data. Could a datacenter choose not to support no arg fetch?

chad-earthscope commented 1 year ago

My premise: the point and criteria at which a server decides the client has it "all" is somewhat arbitrary and often is no longer true moments later. In general the server does not have special knowledge of the data flow, only what is it it's buffers, which is not all that special and may change a moment later.

A client can detect when the data streams being received are a) within some tolerance of "now" and b) when data has stopped flowing for X seconds, and send BYE to close the stream. A downside is needing to wait X seconds on each (re)connect for data streams that are not actively streaming current data. But otherwise, this would likely end up a very close equivalent of the data flow using FETCH. Close enough that I question why we need it. Consider also that the client end probably knows a lot more about the network connection limitations to the server than the server itself, and may have other criteria by which it wishes to limit the uptime. I do not feel very strongly that it should be removed, but it's protocol and implementation complexity that we might be able to do without.

Could a datacenter choose not to support no arg fetch?

A data center can choose to limit client access based on all sorts of criteria including "too much" data; that must be an option for a server to protect itself.

Note that the entire feed from the EarthScope export server is currently ~1.8 MB/second of 28,000+ streams. Of course it will grow as data rates increase, but right now that's less than a 4K video stream.

andres-h commented 1 year ago

In my own implementations I've used "uni-station" mode as an all-stations mode (i.e. DATA without a STATION) because it matches most use of uni-station mode, and is such a useful mode. I'm considering implementing this again for v4 and would prefer it to be part of the standard if others agreed.

Not sure if saving few bytes is worth of making the logic more complex. You aren't typing those commands by hand.

what is the use case for FETCH?

Grabbing what is available rather than waiting for complete data. Note that FETCH can also be used with time windows.

It could be used by an early warning application or a cron job that updates plots every 30 minutes. You don't want to get stuck waiting for data if a station is not sending.

A client can detect when the data streams being received are a) within some tolerance of "now" and b) when data has stopped flowing for X seconds, and send BYE to close the stream.

Yes, that is kind of "emulating" FETCH using DATA. I don't think it is a very clean solution.

andres-h commented 1 year ago

I can see that FETCH -1 waiting for data has some disadvantages as it conflicts with the idea of immediately getting what is available. Maybe a compromise would be returning the last packet of the queue (eg., similar to -1 as Python array index)?

crotwell commented 1 year ago

I am not sure I see the use case for FETCH -1. Is there a reason a client would only want one packet? I also like the idea that the fundamental difference between FETCH and DATA is FETCH never waits for more packets, while DATA does. FETCH -1 seems to break that.

Perhaps making the commands more explicit would avoid confusion. I don't think we really need to save a couple of bytes when we can design the protocol to avoid magic numbers. It seems like the -1 in DATA with time is just a placeholder. Perhaps all cases can be covered by:

DATA - get new packets as they arrive in the queue, no existing packets in queue DATA ALL - start with first acceptable packet in queue, continue as new packets arrive DATA SEQ seqnum - start with first packet whose sequence number is > seqnum, continue as new packets arrive
DATA TIME starttime [endtime] - start with first packet whose start > starttime and optionally < endtime, continue as new packets arrive. FETCH SEQ seq - get all packets currently in the queue start with first packet whose sequence number is > seqnum, END after sending last acceptible packet from queue FETCH TIME starttime [endtime] - start with first packet whose start > starttime and optionally < endtime, END after sending last acceptible packet from queue FETCH ALL - start with first acceptable packet in queue, END after sending last acceptable packet from queue.

Maybe servers that do not want to allow for the ALL variants can respond with ERROR?

Does it make sense to allow DATA TIME starttime endtime. The DATA command doesn't send END in any other cases, so this is little weird. Does it send END, or is the connection left open forever? Maybe only the FETCH can have starttime and endtime?

crotwell commented 1 year ago

In my own implementations I've used "uni-station" mode as an all-stations mode (i.e. DATA without a STATION) because it matches most use of uni-station mode, and is such a useful mode. I'm considering implementing this again for v4 and would prefer it to be part of the standard if others agreed.

Not sure if saving few bytes is worth of making the logic more complex. You aren't typing those commands by hand.

I have vague memory of decision that uni-station mode was not going to be part of seedlink4. I agree that saving bytes isn't worth it here as STATION * * handles this exact use case.

andres-h commented 1 year ago

I am not sure I see the use case for FETCH -1. Is there a reason a client would only want one packet?

To start cyclic transmission from current data, but not wait for next packet.

I also like the idea that the fundamental difference between FETCH and DATA is FETCH never waits for more packets, while DATA does. FETCH -1 seems to break that.

No. In order to not break that, I suggested a change that FETCH -1 would return one packet and not wait for more packets. Of course, DATA -1 would then also start from the youngest packet in the queue and not (as currently proposed) from the next packet (very subtle difference).

Does it make sense to allow DATA TIME starttime endtime. The DATA command doesn't send END in any other cases, so this is little weird. Does it send END, or is the connection left open forever?

It is left open forever, because it is difficult to detect when endtime is reached. Eg., should it wait for very low sample rate streams?

FETCH, on the other hand, just returns when there is currently no more data in the buffer. It does not guarantee that the user gets full time window (eg., if the data is delayed).

crotwell commented 1 year ago

To start cyclic transmission from current data, but not wait for next packet. I don't understand this use case, can you elaborate?

No. In order to not break that, I suggested a change that FETCH -1 would return one packet and not wait for more packets. Of course, DATA -1 would then also start from the youngest packet in the queue and not (as currently proposed) from the next packet (very subtle difference).

...and if there are no packets in the queue and the station has been destroyed and will never send more data, but neither the server nor the client know that? Does it wait forever, or send END without sending a packet? Seems like if a client is starting from scratch, sending an INFO to find out the state of the queue makes more sense then trying to FETCH a single packet?

andres-h commented 1 year ago

...and if there are no packets in the queue and the station has been destroyed and will never send more data, but neither the server nor the client know that? Does it wait forever, or send END without sending a packet?

If the station was working before and it is configured in the server, there may be some data in the buffer. If the buffer is totally empty, it would send END without sending a packet.

I don't know how far the proposal is, but if it was already reviewed/accepted or something like that, we should probably not make fundamental changes.

chad-earthscope commented 1 year ago

I don't know how far the proposal is, but if it was already reviewed/accepted or something like that, we should probably not make fundamental changes.

The technical review stage has just begun, this is the time to discuss and consider final changes before a recommendation to approve or not is generated.

chad-earthscope commented 1 year ago

In my own implementations I've used "uni-station" mode as an all-stations mode (i.e. DATA without a STATION) because it matches most use of uni-station mode, and is such a useful mode. I'm considering implementing this again for v4 and would prefer it to be part of the standard if others agreed.

Not sure if saving few bytes is worth of making the logic more complex. You aren't typing those commands by hand.

I have vague memory of decision that uni-station mode was not going to be part of seedlink4. I agree that saving bytes isn't worth it here as STATION handles this exact use case.

No problem and understandable. The logic is not more complex in my own implementations so it will remain as an extension due to it's high convenience and consistent behavior.

chad-earthscope commented 1 year ago

...and if there are no packets in the queue and the station has been destroyed and will never send more data, but neither the server nor the client know that? Does it wait forever, or send END without sending a packet?

If the station was working before and it is configured in the server, there may be some data in the buffer. If the buffer is totally empty, it would send END without sending a packet.

This description of behavior for an empty buffer sounds like what i suggested in https://github.com/FDSN/SeedLink/issues/9#issuecomment-1430225972

That seems the most sensible to me.

FETCH -1 and DATA -1 returning youngest packet could transmit that same packet multiple times, forcing the client or other down stream process to deal with the generated duplicates. Not good.

crotwell commented 1 year ago

My vote is remove FETCH -1 and DATA -1 and require a clean-start client to either use INFO or use a starttime if that is available.

If the vote is to retain FETCH -1, can we at least rename it to not use -1 as it is really not acting like a seq number at that point. Perhaps FETCH LATEST or something meaningful?

I guess am ok with FETCH ALL to get all queued data if there is a strong use case for it, but am unsure if that exists. @ozym can you elaborate on your idea. Do you have a specific use case that should be supported?

andres-h commented 1 year ago

FETCH -1 and DATA -1 returning youngest packet could transmit that same packet multiple times, forcing the client or other down stream process to deal with the generated duplicates. Not good.

I don't see why a client should use -1 repeatedly. If it uses seq 55 repeatedly it also gets the same packet multiple times.

I'll try to stay out of this discussion as it leads to nowhere like many times before. If the proposal gets approved, I can implement everything that I need as a workaround or extension anyway.

ozym commented 1 year ago

This discussion has fleshed out a lot of what the two commands are expected to do. However, I'm still not sure what the difference is between a DATA request with a given end-time and the same request using FETCH. Maybe it's just a matter of expanding on this in the documentation.

I'm pretty sure the use case I was thinking of could be handled with a FETCH with a very wide time window, or doing an INFO request to extract the sequence numbers prior to a request.

ozym commented 1 year ago

Thinking on this, I now read it as future transmission of data (DATA), rather than existing data (FETCH).

chad-earthscope commented 1 year ago

A client can detect when the data streams being received are a) within some tolerance of "now" and b) when data has stopped flowing for X seconds, and send BYE to close the stream.

Yes, that is kind of "emulating" FETCH using DATA. I don't think it is a very clean solution.

I agree, it's not clean. But perhaps it's sufficient for the use cases that FETCH supports.

Do we know how often FETCH is used? My impression is that is use is very rare, if at all, but that may be wrong.

chad-earthscope commented 1 year ago

However, I'm still not sure what the difference is between a DATA request with a given end-time and the same request using FETCH.

Thinking on this, I now read it as future transmission of data (DATA), rather than existing data (FETCH).

I believe with FETCH the server will close the connection when it has sent all the data it can to fulfill the request, whereas with DATA it will wait for the time window to be completed. How a given server determines that the time window is completed is a gray area.

crotwell commented 1 year ago

Do we know how often FETCH is used? My impression is that is use is very rare, if at all, but that may be wrong.

Important question, if FETCH is not commonly used, or if the is not a compelling use case for it, then only having DATA makes sense to me.

The only use case I can come up with where FETCH is needed is pulling packets from a station over a satellite link, but I have no idea if anyone actually uses seedlink for this. Guess the answer is no.

andres-h commented 1 year ago

Feedback from proposal team

Define special sequence number -2 (start of buffer) in addition to previously defined -1 (end of buffer).

“FETCH -1” MAY return data if next packets arrive within a certain small time period.

“FETCH -2” returns all data that is available in the server for requested station(s)¹

Discussion

¹ What should be the role of -2 when time window is used?

crotwell commented 1 year ago

I still feel that using negative numbers to be special cases is confusing, a source of bugs and has no advantages I can see. If that all/latest functionality is needed, just make a separate command for each, like FETCHALL.

Consider a client that has packet number 8 and decides to fetch the 10 previous packets:

currSeqNum = 8
....
startSeqNum = currSeqNum-10
cmd = "FETCH "+startSeqNum

and ends up getting all the data that the server has.

djeastonca commented 11 months ago

From the specification and discussion in this issue thread, I've summarized the use cases I see and approaches that can be taken to satisfy them: 1) Real-time streaming: both scenarios below are satisfied by the DATA command in the specification proposal

Initiated without packet backfill (client wants to start with live data streaming): DATA request with no parameters supplied
Initiated with packet backfill (client wants to resume receiving streamed data with historical/backfill data which automatically blends into live data streaming): DATA request with a sequence number provided

2) Historical data retrieval (server only returns data available to it when processing the request): 3 alternative approaches can be taken:

DATA request (time-based), with the onus on the client to determine when enough data has been received, after which the client initiates connection termination
FETCH request (time-based), whereby the server provides all of the eligible data it has at the time the request is processed
Satisfy the data retrieval request outside of SeedLink (see further below)

3) Cyclic data transmission (a form of the historical retrieval use case - periodically update with any new data over a temporary network connection) follows suit from item 2 above:

DATA request, where a sequence # specifies the start of scope of data returned and then the onus is on the client to terminate the connection when enough data has been received, or the client has waited what it deems as long enough
FETCH request, where a sequence # specifies the start of scope of data returned, the server returns whatever data is available to it at the time of processing, then initiates termination of the connection with END. At the very beginning of cyclic data transmission from a station the sequence number is not known, so an INFO STREAMS request to determine the latest sequence number could be used.
Satisfy the data retrieval request outside of SeedLink (see further below)

The possible uses of FETCH to satisfy the use case scenarios above are all done without the use of magic numbers, since it is ideal to avoid these in protocol specifications. Further to that end, if there are additional use cases involving the DATA command that actually require start/end time parameters then it would be ideal to consider making a small protocol syntax adjustment to the specification to eliminate the potential ambiguity between sequence number and start time parameters of the DATA command.

@chad-earthscope , from his experience working in a data management center environment providing a SeedLink interface, has indicated earlier in this issue thread his sense that the use of FETCH is rare. From my experience working for a company whose products include dataloggers, when SeedLink is used for data acquisition over a network, real-time mode is used exclusively on those products; on-demand data retrieval requests are instead satisfied without SeedLink e.g. using the FDSN dataselect web service. The cyclic data transmission use case above is not as cleanly satisfied with FETCH, but it may be a moot point. Overall, it would be good to know if others have awareness to the contrary such that SeedLink FETCH is actively used in practice in favour over other historical data retrieval methods, to determine whether it's worth keeping FETCH or instead dropping it to simplify server implementations.

crotwell commented 9 months ago

Issue #17 if resolved as proposed may effect this, possibly rendering it moot.

FDSN / SeedLink

FETCH all data? #9

Feedback from proposal team

Discussion