add lookback to `get_fund_data()`

AlexAxthelm commented 9 months ago

⚠️ Potentially Breaking: Changes logic for pulling FactSet data.

Previous behavior: For certain fields in factset data, filter to exact date of data_timestamp (as-of date for pulling FactSet Data). Mostly applied to end-of month or end-of-period reporting values (ex. share prices)

New behavior: Pull most recent value, if most recent value is after data_timestamp_lookback (which defaults to 1 month before data_timestamp for share price & fund holdings, 1 year before for financing data.

For example, if we enter data_timestamp <- 20230123 (Jan 23, 2023), then filtering on exact date yield no rows for adj_price (Adjusted share price) when pulling financial data.

However, with the lookback of 1 month, we get all the rows for securities which posted a share price as of December 31 2022 (since that is after Dec 23, 2022).

Key decision points:

Is this behavior what we expect
Do we want to implement stricker checking to get only end-of-month values?
- I've only observed end-of-month in actual data, but more checks aren't a bad thing
Should the default lookback be 1 month? Or do we want a different period for inclusion
- Note that a liot fo the FS tables include both active and terminated securities, so filtering to only active securities may be a bit tricky. Right now looking back 1 month is serving as a proxy filter for active securities.

github-actions[bot] commented 9 months ago

Docker image from this PR (849d0096faeeb2ae3e53da0747e08f775e95395e) created

docker pull ghcr.io/rmi-pacta/workflow.factset:pr3

Antoine-Lalechere commented 9 months ago

Not sure to 100% get it here.

If in the previous case, we also get an adj_price for the Jan 10, 2023 , which one would we have at the end of the process, the newest one closer to data_timestamp or still the Dec 31 2023, because there is another filter somewhere else?

@AlexAxthelm

AlexAxthelm commented 9 months ago

If in the previous case, we also get an adj_price for the Jan 10, 2023 , which one would we have at the end of the process, the newest one closer to data_timestamp or still the Dec 31 2023, because there is another filter somewhere else?

In the current process, if we specify data_timestamp to be Jan 10, then we should end with no share prices or fund holdings, since in the actual data, none of those have end-of-reporting-period dates on Jan 10.

Theoretically, if we got a share price reported on Dec 31, and Jan 10, and we specified Jan 23 as our date of interest, we would get the more recent one (Jan 10)

cjyetman commented 9 months ago

If you specify a date, the data should be from that date... otherwise we have a lot of explaining to do.

AlexAxthelm commented 7 months ago

Closing, since there's been no pick up on this thread.

NAYRA-HERRERA commented 7 months ago

@AlexAxthelm so what happened with this ? does it mean it was implemented or not? and what are the consequences? Is a related ADO ticket for these type of decisions?

AlexAxthelm commented 7 months ago

Not implemented, so no changes and no consequences, since there wasn't any further discussion around it. If we want to pick up this thread again we can at any time, but someone other than me should probably sketch out what it would look like.

AlexAxthelm commented 7 months ago

Reviving this thread, since it looks like for 2023Q4 data, we will need to include data timestamps for 2023-12-31 and 2023-12-29 (last Friday of the year).

Open questions:

Do we want to have a date range, or a discrete list of dates? (2023-12-24 through 2023-12-31 vs c(2023-12-29, 2023-12-31)) I would lean towards the latter for clarity in what's included, but the former would reduce the burden for the future, since we could consistently use "X days lookback" as part of our methodology.

What should be the lookback period? (only relevant if we choose the "range" option above). 1 week back, or 1 month back seem like reasonable options.

~Should we use this lookback filter anywhere we use date_timestamp? See @cjyetman's comment above: https://github.com/RMI-PACTA/workflow.factset/pull/3#issuecomment-1862642755. I believe that for consistency, whatever filter we're using should be applied to all our data. This would also make messaging easier, since we don't have to worry about communicating nuances like "fund holdings as of these dates, but share price and volume from these")~ This is not a major issue with actual data, so non-fund data sets were removed in f511fe7

~Where in the processing pipeline should we use this filter? Currently, these are applied early on in the process. One possible concern here is that the report_date from different tables may be different for a given fund. (report data for holdings may not sync with report date for filing history). FactSet does not explicitly identify a connection between these two field across these tables.~ Question not relevant in real data. See https://github.com/RMI-PACTA/workflow.factset/pull/3#issuecomment-1970923192

@cjyetman @Antoine-Lalechere @NickPACTA @Nicky-Halterman @NAYRA-HERRERA cc @hodie @jdhoffa

cjyetman commented 7 months ago

Since different metrics/values have different levels of volatility, significantly so in some cases, I don't think there's a justifiable way to apply the same date range to all data.

AlexAxthelm commented 7 months ago

Where in the processing pipeline should we use this filter? Currently, these are applied early on in the process. One possible concern here is that the report_date from different tables may be different for a given fund. (report data for holdings may not sync with report date for filing history). FactSet does not explicitly identify a connection between these two field across these tables.

Reviewing actual data shows this is incredibly uncommon (1 entry of 4.4M). Will resolve with join condition that holding report date must be equal to filing report date. Simplifies filtering options significantly.

AlexAxthelm commented 7 months ago

I've reached out to FactSet support for confirmation on methodolgy for pulling fund composition as-of a given date.

AlexAxthelm commented 7 months ago

We've heard back from FactSet Support:

Regarding the query 'for determining the current composition, should we extract the most recent reporting date for each fund - our Ownership team confirmed ,YES!

So we're on the right track, in that getting a funds (known) composition involves getting the data from its most recent report_date

RMI-PACTA / workflow.factset

add lookback to `get_fund_data()` #3