to_weight with mask returns incorrect NaN

temph2020 commented 1 month ago

Thank you very much to the author for contributing this excellent library. I conducted a simple test using the data provided in this repository (see tests/data/daily). The code is as follows:

import spectre from os.path import dirname data_dir = dirname(file) + '/data/' loader = spectre.data.CsvDirLoader( data_dir + 'daily/', ohlcv=('uOpen', 'uHigh', 'uLow', 'uClose', 'uVolume'), prices_index='date', parse_dates=True, ) engine = spectre.factors.FactorEngine(loader) universe = spectre.factors.OHLCV.volume.top(1) engine.set_filter(universe) engine.add(spectre.factors.OHLCV.open, 'open') engine.add(spectre.factors.OHLCV.open.to_weight(mask=universe), 'weight') df = engine.run("2019-01-01", "2019-01-04") print(df)

Output as follows. The weights are all NaN, which theoretically should all be 1 because the universe contains only one stock.

date asset open weight 2019-01-03 00:00:00+00:00 MSFT 103.78 NaN 2019-01-04 00:00:00+00:00 AAPL 148.84 NaN

I am using torch 2.0.1

Heerozh commented 1 month ago

You should disable demean, to_weight(mask=universe, demean=False)

when the values has no variance, the result after demean is 0, so the division by 0 error occurs.

temph2020 commented 1 month ago

Thanks. I set to_weight(mask=universe, demean=False), and there is still a NaN as follows:

date asset open weight 2019-01-03 00:00:00+00:00 MSFT 103.78 NaN 2019-01-04 00:00:00+00:00 AAPL 148.84 1.0

If we use: engine.add(spectre.factors.OHLCV.open.to_weight(), 'weight') without mask, the the result is correct, see:

date asset open weight 2019-01-03 00:00:00+00:00 MSFT 103.78 -0.5 2019-01-04 00:00:00+00:00 AAPL 148.84 0.5

So, I think the problem is related to mask.

Heerozh commented 1 month ago

You should use real data sources instead of test data, which contains corrupted data (intentional)

temph2020 commented 1 month ago

I discovered the issue on the real data source. To make it easier for you to reproduce the results, I also ran it on your data source. Here are the results from the real data source. An unexpected NaN appeared on January 5th.

loader = ArrowLoader('./df_raw.feather') start = '2022-1-1' end = '2022-1-7' engine = factors.FactorEngine(loader) engine.to_cpu()

universe = factors.AverageDollarVolume(win=21).top(3) engine.set_filter( universe ) myopen = factors.OHLCV.open engine.add(myopen, 'myopen') engine.add(myopen.to_weight(demean=True, mask=universe), 'weight') df = engine.run(start, end, delay_factor=True) print(df)

output:

date asset myopen weight 2022-01-04 00:00:00+00:00 000858 279.145659 -0.292667 300750 608.354321 -0.207333 600519 3337.147931 0.500000 2022-01-05 00:00:00+00:00 000858 279.346248 NaN 300750 568.600474 -0.186907 600519 3320.908768 0.500000 2022-01-06 00:00:00+00:00 300750 540.602121 -0.189343 600519 3283.574933 0.500000 600941 57.880000 -0.310657 2022-01-07 00:00:00+00:00 300750 552.136760 -0.186127 600519 3207.234630 0.500000 600941 57.800000 -0.313873

temph2020 commented 1 month ago

If I change factors.OHLCV.open to factors.OHLCV.close, then the result is correct. It seems the problem is related to should_delay=False in factors.OHLCV.open.

temph2020 commented 1 month ago

I spent an entire day analyzing the issue and finally discovered the root cause. The previous description was somewhat confusing, so let me clarify the problem again. As shown in the code and output below, on January 6th, the weight of 600905 appeared as an unexpected NaN. The reason is that, although it is in universe_delayed (which is equivalent to universe), it is not in universe_notDelayed. Theoretically, it should not be affected by universe_notDelayed and should still be able to compute the weight. It seems that spectre here confuses universe_delayed with universe_notDelayed, leading to the calculation error. 1727000629943

Another point is that OHLCV.open is also not delayed. If we use OHLCV.close( which is delayed), then the weight is correct.

Heerozh commented 1 month ago

by default, open data will not shift a bar, but close data does. This is for HFT, but easier to get confused. For your daily data, you can try to use df = engine.run(start, end, delay_factor='all') (do not use set_delay)

After this setting, all result calculated (on the current bar) will use the data from the previous bar. (Otherwise, will be use current bar data as much as possible)

temph2020 commented 1 month ago

df = engine.run(start, end, delay_factor='all') have the same result. I know OHLCV.open is not delayed by default and this is exactly what I want. My strategy is to calculate the factor weights based on the opening price of the current day immediately after the market opens each morning, and then place orders at the opening price of the current day. The universe consists of stocks of the top 2 trading volumes from the previous day. Therefore, the logic in my code for calculating weights should achieve my goal, but there is an unexpected NaN on January 6th. Do you think that NaN is consistent with the internal logic of spectre?

Heerozh commented 1 month ago

I know OHLCV.open is not delayed by default and this is exactly what I want. My strategy is to calculate the factor weights based on the opening price of the current day immediately after the market opens each morning, and then place orders at the opening price of the current day. The universe consists of stocks of the top 2 trading volumes from the previous day.

If this case, it is still recommend to use delay_factor='all'

and then manually use open.shift(-1) as your intraday data, which is for clearer and more intuitive.

Let me say that the underlying automatic delay method is not designed for hand-written code, but for AI digging and hft.

Do you think that NaN is consistent with the internal logic of spectre?

I can't test code at the moment, but I think the nan is because of the mixed usage of delay and non-delay factors

temph2020 commented 1 month ago

In my code, I use df_factor = engine.run(start, end, delay_factor=True), You suggest delay_factor='all', I think True and 'all' are equivalent in this function. I digged the code and test, the results are always the same. Anyway, I set it to 'all' this time and use open.shift(-1) as your suggestion. and get the following result. on January 6th， we still get a NaN. By the way, I use daily data.

temph2020 commented 1 month ago

I finally resolved it by using: engine.add(myopen.to_weight(demean=False, mask=universe.shift(1)), 'weight')

this way, I get correct weights and no NaNs. But I still think, mask should be shifted by spectre automaticly, not manually set.

Heerozh commented 1 month ago

Sorry, looks like I not push the code related to delay_factor='all' yet. 😀

if I'm not mistaken, reason for the nan probably like this: the mask uses today's universe, but the output filter set_filter uses yesterday's universe

But I still think, mask should be shifted by spectre automaticly, not manually set.

Delay is a output stage process, will not performed during factor calculation stage

temph2020 commented 1 month ago

So, will you push your new code with delay_factor='all' ?

Heerozh commented 1 month ago

There are some code related to companies work, I need to branch them out, always lazy to do this

Heerozh / spectre

to_weight with mask returns incorrect NaN #27