bgpkit / bgpkit-broker-backend

BGPKIT Broker backend services and API
MIT License
7 stars 0 forks source link

Timestamps range boundaries are incorrectly checked #31

Open mikolaj-kow opened 1 year ago

mikolaj-kow commented 1 year ago

Greetings,

in the broker backend documentation, the timestamp is defined as the time of the beginning of the data dump, usually aligns on the full five-minute marks.

For example, I want to request 15 minutes of data, therefore i expect to receive three BrokerItems. However, broker produces four BrokerItem.

broker = bgpkit.Broker()
items = broker.query(ts_start="2014-04-01 00:00:00", ts_end="2014-04-01 00:14:59", project="riperis", data_type="update", print_url=True, collector_id="rrc00")
[BrokerItem(ts_start='2014-03-31T23:55:00', ts_end='2014-04-01T00:00:00', collector_id='rrc00', data_type='update', url='https://data.ris.ripe.net/rrc00/2014.03/updates.20140331.2355.gz', rough_size=155648, exact_size=0),
 BrokerItem(ts_start='2014-04-01T00:00:00', ts_end='2014-04-01T00:05:00', collector_id='rrc00', data_type='update', url='https://data.ris.ripe.net/rrc00/2014.04/updates.20140401.0000.gz', rough_size=129024, exact_size=0),
 BrokerItem(ts_start='2014-04-01T00:05:00', ts_end='2014-04-01T00:10:00', collector_id='rrc00', data_type='update', url='https://data.ris.ripe.net/rrc00/2014.04/updates.20140401.0005.gz', rough_size=145408, exact_size=0),
 BrokerItem(ts_start='2014-04-01T00:10:00', ts_end='2014-04-01T00:15:00', collector_id='rrc00', data_type='update', url='https://data.ris.ripe.net/rrc00/2014.04/updates.20140401.0010.gz', rough_size=174080, exact_size=0)]

This results in

elems = bgpkit.Parser(url=items[0].url).parse_all()
print(datetime.utcfromtimestamp(elems[0]["timestamp"]))
>>>2014-03-31 23:55:00

If I try this:

broker = bgpkit.Broker()
items = broker.query(ts_start="2014-04-01 00:00:01", ts_end="2014-04-01 00:15:00", project="riperis", data_type="update", print_url=True, collector_id="rrc00")
[BrokerItem(ts_start='2014-04-01T00:00:00', ts_end='2014-04-01T00:05:00', collector_id='rrc00', data_type='update', url='https://data.ris.ripe.net/rrc00/2014.04/updates.20140401.0000.gz', rough_size=129024, exact_size=0),
 BrokerItem(ts_start='2014-04-01T00:05:00', ts_end='2014-04-01T00:10:00', collector_id='rrc00', data_type='update', url='https://data.ris.ripe.net/rrc00/2014.04/updates.20140401.0005.gz', rough_size=145408, exact_size=0),
 BrokerItem(ts_start='2014-04-01T00:10:00', ts_end='2014-04-01T00:15:00', collector_id='rrc00', data_type='update', url='https://data.ris.ripe.net/rrc00/2014.04/updates.20140401.0010.gz', rough_size=174080, exact_size=0),
 BrokerItem(ts_start='2014-04-01T00:15:00', ts_end='2014-04-01T00:20:00', collector_id='rrc00', data_type='update', url='https://data.ris.ripe.net/rrc00/2014.04/updates.20140401.0015.gz', rough_size=121856, exact_size=0)]

I get this;

elems = bgpkit.Parser(url=items[-1].url).parse_all()
print(datetime.utcfromtimestamp(elems[-1]["timestamp"]))
>>>2014-04-01 00:20:00

So it seems that timespans I defined are already shorter than 15 minutes

ts_start='2014-04-01T00:00:01'
ts_end  ='2014-04-01T00:15:00'
datetime.fromisoformat(ts_end) - datetime.fromisoformat(ts_start)
>>> datetime.timedelta(seconds=899)

Easy workaround is using ts_start="2014-04-01 00:00:01", ts_end="2014-04-01 00:14:59"

Best regards

digizeph commented 1 year ago

Hi @mikolaj-kow, thanks for bringing this up. I am aware of this issue but ended up keeping the current behavior because. The main reason is that MRT files sometimes do include messages at the exact end timestamp. For your example, it included the previous file because it might contain messages announced at the exact hour mark.

I am open to change the default behavior to not include the files if its end time overlaps with the start time of the filter, and add additional flag for the current behavior.

mikolaj-kow commented 1 year ago

Hi, It's a tough one :) I get the idea behind returning more results than omitting some. However, typically one side of a range should be "greater than or equal to" / "less than or equal to". In that way, one can define a continuity of ranges, without overlaps or gaps.

I tried to apply a Parser timestamp filter, to overcome this behaviour of Broker and have a precise timerange. This should be possibe, however in python it fails with ValueError: Error: unknown filter type: ts_start and ValueError: Error: unknown filter type: ts_end errors. Can you have a look?

digizeph commented 1 year ago

@mikolaj-kow it seems like there is some inconsistencies on the naming of filters. Could you try start_ts and end_ts instead? I will make sure both versions will be acceptable in the next release.

mikolaj-kow commented 1 year ago

I was able to successfully use the following dictionary as Parser filter {'start_ts': '1396364401.0', 'end_ts': '1396450801.0'} Notice that f64 values are casted to str. When I try to use values from datetime.timestamp() directly, I'm getting TypeError: argument 'filters': 'float' object cannot be converted to 'PyString'

digizeph commented 1 year ago

Good to hear! Yeah, the string conversion is expected. The timestamps parameters accept both unix timestamps and rfc3999 time strings (e.g. 2020-01-02T00:00:00Z), and thus need to be Strings.

mikolaj-kow commented 1 year ago

Getting kind of offtopic ;) but wanted to let you know, I had no luck with either of these: ValueError: Error: cannot parse f64 value from 2020-01-02T00:00:00 ValueError: Error: cannot parse f64 value from 2020-01-02T00:00:00Z ValueError: Error: cannot parse f64 value from 2020-01-02T00:00:00+00:00

The last one is ISO format from dt.isoformat() but AFAIK should be compliant with RFC3999

digizeph commented 1 year ago

This is great catch! Will attempt to patch these later today.

mikolaj-kow commented 1 year ago

I've tested the Broker.query() as well, for a more complete picture.

For timestamps, int and str(int) works ok, but float and str(float) does not work; however, this worked with the Parser. So this depends if you wish to support microsecond resolution.

RFC/ISO formatted datetime without timezone and with Zulu works without issues, even with microseconds it's ok

RFC/ISO formatted datetime with timezone with negative offset like 2020-01-02T00:00:00-00:00 works ok, but positive offset like 2020-01-02T00:00:00+00:00 or 2020-01-02T00:00:00.123000+00:00 does not work.

ile ~/.pyenv/versions/bgpcompute-3.11.2/lib/python3.11/site-packages/bgpkit/bgpkit_broker.py:63, in Broker.query(self, ts_start, ts_end, collector_id, project, data_type, print_url)
     61 res = requests.get(api_url).json()
     62 while res:
---> 63     if res["count"] > 0:
     64         data_items.extend([BrokerItem(**i) for i in res["data"]])
     66         if res["count"] < res["page_size"]:

TypeError: '>' not supported between instances of 'NoneType' and 'int'
digizeph commented 1 year ago

Hi @mikolaj-kow, thank you for your efforts! Are you using the default broker instance or self-hosted one? I've updated the default instance to allow float number search. I cannot however reproduce the other timezone-related issues you mentioned. (see the updated test here: https://github.com/bgpkit/pybgpkit/blob/7f4e6cd2d654c505821554daf720d10a9415fce8/bgpkit/test_integration.py#L28)

mikolaj-kow commented 1 year ago

You're welcome. I'm happy I can help a bit.

I'm using a self-hosted docker instance due to a considerable amount of queries. It appears that the default broker works just fine for the same query.

≻ curl "0.0.0.0:18888/search?ts_start=2022-02-02T00:00:00-00:00&ts_end=2022-02-02T00:20:00.123000+00:00&collector_id=rrc00"
{"count":null,"page":null,"page_size":null,"error":"failed to parse ts_end time string: Expected an ISO 8601-like string, but was given '2022-02-02T00:20:00.123000 00:00'. Try passing in a format string to resolve this.","data":null}
≻ curl "0.0.0.0:18888/search?ts_start=2022-02-02T00:00:00-00:00&ts_end=2022-02-02T00:20:00+00:00&collector_id=rrc00"
{"count":null,"page":null,"page_size":null,"error":"failed to parse ts_end time string: Expected an ISO 8601-like string, but was given '2022-02-02T00:20:00 00:00'. Try passing in a format string to resolve this.","data":null}
≻ curl "https://api.bgpkit.com/broker/search?ts_start=2022-02-02T00:00:00-00:00&ts_end=2022-02-02T00:20:00+00:00&collector_id=rrc00"
{"count":7, ...

However if I escape the + sign with %2B I get expected output

≻ curl "0.0.0.0:18888/search?ts_start=2022-02-02T00:00:00-00:00&ts_end=2022-02-02T00:20:00.123000%2B00:00&collector_id=rrc00"
{"count":7 ...

The broker-api container logs

INFO:     172.18.0.1:44844 - "GET /search?ts_start=2022-02-02T00:00:00-00:00&ts_end=2022-02-02T00:20:00+00:00&collector_id=rrc00 HTTP/1.1" 200 OK
INFO:     172.18.0.1:47670 - "GET /search?ts_start=2022-02-02T00:00:00-00:00&ts_end=2022-02-02T00:20:00.123000+00:00&collector_id=rrc00 HTTP/1.1" 200 OK
INFO:     172.18.0.1:42460 - "GET /search?ts_start=2022-02-02T00:00:00-00:00&ts_end=2022-02-02T00:20:00.123000%2B00:00&collector_id=rrc00 HTTP/1.1" 200 OK
digizeph commented 1 year ago

@mikolaj-kow Thank you for providing us with the detailed logs!

Currently, there is a tech-stack discrepancy between our hosted version and the docker version. Our hosted version has been updated to a serverless JavaScript-based API, which can handle concurrent requests more efficiently and integrate seamlessly with our backend Postgres instance. On the other hand, the docker version remains a Python-based API, and there are some differences in the parameter checking.

We are actively working on closing this gap by incorporating a Postgrest instance into the docker setup, which will enable querying via HTTP and allow us to reuse the same JavaScript API for the self-hosted version. We anticipate that this upgrade will be completed in the coming weeks.

In addition, we have made significant improvements to our hosting infrastructure, and the default broker is now fully equipped to handle large numbers of requests. Please feel free to utilize it for your purposes. If you have any further concerns or require customization, please do not hesitate to reach out to us at contact@bgpkit.com.