fja05680 / pinkfish

A backtester and spreadsheet library for security analysis.
https://fja05680.github.io/pinkfish
MIT License
266 stars 58 forks source link

Undesirable results from select_tradeperiod() #42

Closed EcoFin closed 3 years ago

EcoFin commented 3 years ago

I like the pinkfish codebase a lot. As a practical test I tried to replicate a very simple crossover experiment on data from 1970 - mid 2011. That led me to find several data handling problems. It is probably better to identify them as separate issues. Here is the first one:

How come? because of the dropna() in select_tradeperiod. Why is that? For some reason, the yahoo data has no open from 1971 to a single day in 1978 and then nothing until 1982. (finalize_timeseries drops another year on account of NaNs in the indicator calculation).

The result is a ts dataframe with: one record for 1978-07-26 and the next record for 1982-04-20. That is not desirable behaviour. I understand that dropping records with no open might be what you want for intraday crypto trading, but for EOD equities or ETFs it's not what is expected. And for sure there is a calendaring problem.

This is a pathological situation that I have never seen before. In my experience, yahoo's data is quite reliable.

Two possible fixes come to mind:

  1. more data checking in select_tradeperiod(), but that is a never-ending struggle
  2. let the user identify priority columns for the dropna()

My own approach is never to dropna() automatically; I've had too many bad experiences.

ay

fja05680 commented 3 years ago

I have implemented a less aggressive nan strategy that fixes the problem described in this issue. This approach gives more control and responsibility to the pinkfish user to solve problems with their timeseries, while also providing options to automatically fix problems.

(1) fetch_timeseries() no longer removes rows with nan values. (2) select_tradeperiod() lets you specify the columns to check for nan values via check_fields argument. The default is ['close']. You can specify an empty list if you want. (3) finalize_timeseries() has an argument that lets you specify whether to drop rows with nan values via dropna, default is False. (4) The stock_market_calendar is no longer applied unless force_stock_market_calendar=True is set in select_tradeperiod().

A warning is issued if nan values are detected within finalize_timeseries(). A function was added find_nan_rows()' to help track down problematic timeseries. The above changes were also made for portfolio, exceptdropna` default is True in portfolio.finalize_timeseries().

With the default settings, the example above works as you expect.

"Yahoo data downloads from 1971 ... not good enough, but adequate for a quick test." This is or was a problem on Windows platform only. You can see if it's been resolved by setting from_year=1900 in fetch_timeseries(). If it's no longer a problem, please let me know and I'll remove the check.