QuantConnect / Lean

Lean Algorithmic Trading Engine by QuantConnect (Python, C#)
https://lean.io
Apache License 2.0
9.69k stars 3.25k forks source link

Dask serialization issues due to wrapper object for DataFrame #5395

Open alexanderwiller opened 3 years ago

alexanderwiller commented 3 years ago

Expected Behavior

The DataFrame returned by h = qb.History(qb.Securities.Keys, 10) should be a regular data frame or a conversion method should be available.

Actual Behavior

The DataFrame returned by the history call is some special wrapper object (https://github.com/QuantConnect/Lean/blob/4c085ff853b8a7e63aa4fc9dff7f03c354a75e7f/Common/Python/PandasData.cs#L571), causing various parallel processing libraries to fail during pickling. In the following stacktrace, it can bee seen that the module 'remapper' cannot be found, which is defined here: https://github.com/QuantConnect/Lean/blob/master/Common/Python/PandasData.cs

UserWarning: Distributing <class 'pandas.core.frame.DataFrame'> object. This may take some time.
distributed.core - ERROR - Exception while handling op scatter
Traceback (most recent call last):
  File "/opt/miniconda3/lib/python3.6/site-packages/distributed/core.py", line 500, in handle_comm
    result = await result
  File "/opt/miniconda3/lib/python3.6/site-packages/distributed/scheduler.py", line 4773, in scatter
    nthreads, data, rpc=self.rpc, report=False
  File "/opt/miniconda3/lib/python3.6/site-packages/distributed/utils_comm.py", line 149, in scatter_to_workers
    for address, v in d.items()
  File "/opt/miniconda3/lib/python3.6/site-packages/distributed/utils.py", line 229, in All
    result = await tasks.next()
  File "/opt/miniconda3/lib/python3.6/site-packages/distributed/core.py", line 861, in send_recv_from_rpc
    result = await send_recv(comm=comm, op=key, **kwargs)
  File "/opt/miniconda3/lib/python3.6/site-packages/distributed/core.py", line 662, in send_recv
    raise Exception(response["text"])
Exception: No module named 'remapper'

Potential Solution

  1. Allow conversion into a regular panda data frame. I have tried using the regular pandas DataFrame constructor, but the wrapper remains. I think this would be clearly the preferable way, as any downstream issues wrt pandas parallel processing will be avoided.
  2. Document how the PYTHONPATH has to be modified (in the notebook itself, prior to importing modin) when using the Research Docker image so that the Dask workers can locate the remapper module.

Reproducing the Problem

from clr import AddReference
from QuantConnect.Python import *
from QuantConnect.Brokerages import BrokerageName

import modin.pandas as pd
import pandas as pdo

qb = QuantBook()
qb.SetBrokerageModel(BrokerageName.Bitfinex, AccountType.Margin)
qb.SetStartDate(2021,1,18)
qb.AddCrypto("BTCUSD", Resolution.Minute)

h = qb.History(qb.Securities.Keys, 2 * 10)

# This shows that the remapper wrapper object is kinda 'persistent'
h_pandas = pdo.DataFrame(h)
h_modin = pd.DataFrame(h_pandas)

System Information

quantconnect/research:latest docker image, extended by installation of modin[dask]==0.6.3.

Checklist

jaredbroad commented 1 year ago

Remapper is for symbol id access to data frame -- feature could be adding converter to vanilla data frame.