binance / binance-public-data

Details on how to get Binance public data
1.47k stars 462 forks source link

Bad AggTrade data for month of November 2023 #293

Closed CaymanTurtleBeach closed 9 months ago

CaymanTurtleBeach commented 10 months ago

It appears that every other record in the November BTCUSDT AggTrade data has a transact_time of 1701109037427, which is out of synch with prior and following time stamps Something is really screwy. Here are the first few records; ` agg_trade_id price quantity first_trade_id last_trade_id transact_time is_buyer_maker

1900987669 34651.4 0.258 4244484021 4244484028 1698796805027 FALSE 1929435985 37007.6 0.005 4327092295 4327092295 1701109037427 FALSE 1900987670 34651.3 0.001 4244484029 4244484029 1698796805055 TRUE 1929435986 37007.8 0.008 4327092296 4327092296 1701109037427 FALSE 1900987671 34651.4 0.007 4244484030 4244484031 1698796805056 FALSE 1929435987 37008 0.01 4327092297 4327092297 1701109037427 FALSE 1900987672 34651.3 0.361 4244484032 4244484034 1698796805061 TRUE 1929435988 37008.5 0.01 4327092298 4327092298 1701109037427 FALSE 1900987673 34651.4 0.007 4244484035 4244484037 1698796805076 FALSE 1929435989 37009 0.01 4327092299 4327092299 1701109037427 FALSE 1900987674 34651.3 0.033 4244484038 4244484042 1698796805095 TRUE 1929435990 37009.4 0.005 4327092300 4327092300 1701109037427 FALSE ` Note that the agg trade_id values are also out of synch - there are two different incrementing value streams: 1900987669, 1900987670 ... and interspersed with those values are a stream of 1929435985, 19294435986 ... Please fix and update the November data. And please explain how this could happen - it should be a relatively trivial process to create a file from sequential agg trade events in a month. It is hard to trust any Binance historical data with these kinds of errors, and it also causes (at least for me) suspicions about any Binance data - including real time prices and price matching (i.e., trades). Thanks.

CaymanTurtleBeach commented 10 months ago

After a more thorough review of the data (i.e,, looking at the last records as well as the first), the problem appears to be that records which actually should be at the end of the file were intermittently written incorrectly to the beginning of the file. To wit: The agg_trade_id of the LAST record in the current November BTCUSDT file is 1929435984. The corresponding time is 1701109037427. The agg_trade_id of the SECOND record in the current November file is 1929435985, with a corresponding time of 1701109037427 indicating it (the now second record) should actually follow what is now the last record. This pattern appears to hold throughout the file, though the agg_trade_id of the actual last trade in November is buried somewhere in the middle of the file - the (current) last X records in the file that I viewed have sequential agg_trade_ids, as expected. I would guess this issue is also the cause for the mis-ordered October data reported by @alaky2.

This situation is disturbing. It indicates a fundamental bug has existed for at least two months in what should be a straightforward process of archiving real Trades. I respectfully suggest that Binance contact the folks at Chronicle Queue to learn how to archive data at scale and retrieve it as necessary. A chronical queue can efficiently archive the raw data in the order trades are completed; then re-running the queue at the end of month to aggregate trades per the definition of 'aggregate trades' that Binance publishes should be 'easy' - especially if the types of trades that should be excluded are archived in other chronicle queues. Merging two data streams is a classic computer science 101 exercise - someone at Binance should learn how to do it. Jack Corporate User

darienhuang commented 9 months ago

We are reviewing, will fix this

darienhuang commented 9 months ago

This was due to some software version issue, Nov data on BTCUSDT is fixed

CaymanTurtleBeach commented 9 months ago

Thanks. Are historical files for OCT / NOV fixed (to me it looked like OCT had the same problem) or is the problem fixed just for future monthly files?

UPDATE: Just checked dates for OCT / NOV files and looks like both are updated.

UPDATE-2: NOV data at first glance looks OK. However, OCT data is still scrambled - early data is interspersed with data that should be at the end of the file.