Closed cjyetman closed 7 months ago
It would be important to consider whether the implementation forces the factset-export-manifest.json file to be a mandatory input, or if it only does this if the file is available.
The only time that the file would be missing is when the factset files were generated piecemeal (not through export_pacta_files()
). I would be suspect about using any such files for data prep (beyond development work), so overall I'm not seeing a problem with making it mandatory, but definitely your call.
Also, that manifest file has a lot of info in it that isn't secret, but probably not relevant once we've moved things through data prep (session
and envvars
being the obvious ones). Maybe we only add some of the keys to the dataprep manifest?
The only time that the file would be missing is when the factset files were generated piecemeal (not through
export_pacta_files()
). I would be suspect about using any such files for data prep (beyond development work), so overall I'm not seeing a problem with making it mandatory, but definitely your call.
I was considering the situation where one downloads a set of FactSet files and copies in the ones they think they need into their inputs
directory where their AI inputs also exist (this is typically what I do locally). I get that workflow.factset
always exports all of these files to a single directory, but the directory that one inputs to workflow.data.preparation
does not necessarily have all of those files. I realize that a user probably "should not" manually populate their inputs directory, but there's really no way that I can think of that workflow.data.preparation
can guarantee that a user did not do that.
So the decision would be, do we add factset-export-manifest.json
to the README as a mandatory input file and add tests in the script to ensure that the file exists and throw an error if it does not... or do we copy it over if it exists, but not make it strictly necessary for the process to run?
Also, that manifest file has a lot of info in it that isn't secret, but probably not relevant once we've moved things through data prep (
session
andenvvars
being the obvious ones). Maybe we only add some of the keys to the dataprep manifest?
Happy to have that discussion here, because frankly I don't know what is or is not appropriate and/or potentially secret.
I was considering the situation where one downloads a set of FactSet files and copies in the ones they think they need into their inputs directory where their AI inputs also exist (this is typically what I do locally).
I see. That's an entirely reasonable workflow during development. I would say that the happy compromise here is to emit a warning that the file isn't present, and cannot be included. If the user did things manually, then they shouldn't be surprised by it, but if it's in an automated process, then that's a good sign that something is wrong.
because frankly I don't know what is or is not appropriate and/or potentially secret.
I'm of the mind that nothing secret should be in a manifest file. But also, I'm of the mind that very few things should be actually secret (pretty much only keys/passwords/etc. Obscurity isn't security).
So the main question I see here is one of relevance.
The top-level keys (and if they would be useful downstream as diagnostic information):
files
: file metadata for exported filesdata_timestamp
: data timestamp used for data pull.start_time
: also part of the unique pull ID.export_dir
: directory to which files were exported. Probably not useful since files would be moved/copied as part of data prepenvvars
: Settings for Factset pull. Mostly defining connection to the (ephemeral) DBsession
: info about R session in which workflow.factset was run.metadata_creation_time_date
: when the metadata file was created. effectively a timestamp on when workflow.factset finished (or close to it). Different from start_time
, probably by a few hours, but not actually used anywhere.
workflow.factset
exports afactset-export-manifest.json
with its standard outputs that details everything one might want to know about the FactSet data files in that export (thanks @AlexAxthelm!)We should consider whether we should copy the
factset-export-manifest.json
file to the output directory and/or include the data in thefactset-export-manifest.json
in themanifest.json
that workflow.data.preparation adds to its output directory⚠️ It would be important to consider whether the implementation forces the
factset-export-manifest.json
file to be a mandatory input, or if it only does this if the file is available.AB#10385