Kismuz / btgym

Scalable, event-driven, deep-learning-friendly backtesting library
https://kismuz.github.io/btgym/
GNU Lesser General Public License v3.0
985 stars 260 forks source link

Testing Environment on New Data (Cryptocurrency), getting InvalidIndexError #13

Closed YazzyYaz closed 6 years ago

YazzyYaz commented 6 years ago

Hi, I want to thank you for all that you've been doing for supporting backtrader with a custom OpenAI Gym Environment, it really has helped me a lot.

I'm having difficulties adapting the environment to a new dataset, which is for Bitcoin-USD currency.

MyEnvironment = BTgymEnv(filename='btc_usd.csv')

Here I have added the first 10 lines of my btc_usd.csv, for which I've maintained the same formatting as found in your example csv. I've added volume values to each row as well:

20170911 194700;4209.48;4209.98;4209.48;4209.98;3.0283
20170911 194600;4204.65;4209.49;4204.64;4204.64;1.4119397600000003
20170911 194500;4204.64;4204.65;4204.64;4204.65;7.74892999
20170911 194400;4215.01;4215.01;4203.04;4204.64;11.687842009999995
20170911 194300;4215;4215.01;4215;4215.01;1.6720000000000002
20170911 194200;4215.01;4215.01;4215.01;4215.01;6.270305769999999
20170911 194100;4215.01;4215.01;4215.01;4215.01;0.25924983
20170911 194000;4215.01;4215.01;4215.01;4215.01;1.8750531499999998
20170911 193900;4215.01;4215.01;4215.01;4215.01;0.0867
20170911 193800;4215.21;4215.21;4215;4215.01;5.09584227

The error I'm getting is Pandas related:

Traceback (most recent call last):
  File "/Users/yazkhoury/anaconda3/lib/python3.6/multiprocessing/process.py", line 249, in _bootstrap
    self.run()
  File "/Users/yazkhoury/Desktop/ml-trading/btgym/btgym/dataserver.py", line 89, in run
    episode_dataset = self.dataset.sample_random()
  File "/Users/yazkhoury/Desktop/ml-trading/btgym/btgym/datafeed.py", line 253, in sample_random
    first_row = self.data.index.get_loc(adj_timedate, method='nearest')
  File "/Users/yazkhoury/anaconda3/lib/python3.6/site-packages/pandas/tseries/index.py", line 1411, in get_loc
    return Index.get_loc(self, key, method, tolerance)
  File "/Users/yazkhoury/anaconda3/lib/python3.6/site-packages/pandas/indexes/base.py", line 2138, in get_loc
    indexer = self.get_indexer([key], method=method, tolerance=tolerance)
  File "/Users/yazkhoury/anaconda3/lib/python3.6/site-packages/pandas/indexes/base.py", line 2262, in get_indexer
    tolerance=tolerance)
  File "/Users/yazkhoury/anaconda3/lib/python3.6/site-packages/pandas/indexes/base.py", line 2271, in get_indexer
    raise InvalidIndexError('Reindexing only valid with uniquely'
pandas.indexes.base.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
[2017-09-14 11:00:03,575] Data_server unreachable with status: <receive_failed>.
Data_server unreachable with status: <receive_failed>.

I've attached a WeTransfer url to the CSV file here: https://we.tl/jWPBGGwyzT

I'm just curious as to what the problem could be in Pandas end? I verified that the indexes aren't duplicated for the date-time, and I've tested out the environment on one of your example datasets, and it works perfectly.

I think it might be a formatting issue in my CSV but I'm really not sure. I've saved my CSV file as: datetime, open, high, low, close, volume for each row.

Kismuz commented 6 years ago

@YazzyYaz, At first glance it can be related to fact your file lists date-time entries in time-descending order (most resent time first), while pandas default data.index.get_loc() uses opposite ordering: 20170102 020000;... 20170102 020100;... 20170102 020200;... .... 20170102 021000;

YazzyYaz commented 6 years ago

@Kismuz I tried it that way and I was still getting the same error. What was strange, however, is that it started working when I removed the arg nearest from the .get_loc method. Logging out the adj_timedate gives me an output example like "2017-09-13 06:02:00" so I'm guessing because my adj_timedate is very specific, finding the nearest neighbor of a dataset might fail in this sense? I'm really not sure why it was behaving this way.

I've attached a WeTransfer link in case you'd like to check out the dataset (It's been modified to the correct time-ascending order you pointed out): https://we.tl/Ewks40JBXL

Kismuz commented 6 years ago

@YazzyYaz, Your source file contains duplicate date_time indexes, and I haven't included any checks to brush it out. Easy to see using version of BTgym you currently downloaded:

from btgym import BTgymEnv, BTgymDataset

MyDataset = BTgymDataset(
    #filename=filename='./data/dat_ASCCII_BTCUSD_M1_1.csv',  # -- your renamed file
)
MyDataset.read_csv()
print (MyDataset.data.index.get_duplicates())  # -- shows 19 duplicates in your case 

I have included duplicates check-and-remove feature in BTgym, you can upgrade to latest version and proceed with your files. Setting verbose=1 will print out relevant info:

env = BTgymEnv(filename='./data/dat_ASCCII_BTCUSD_M1_1.csv', verbose=1)
env.close()

Thank you for pointing out this issue!

YazzyYaz commented 6 years ago

@Kismuz Thank you, your new code works! I should have checked the duplicates issue more thoroughly, but now it works really fine. Thank you for helping me with this!