consider copying `factset-export-manifest.json` to outputs and/or copying info to `manifest.json`

cjyetman commented 8 months ago

workflow.factset exports a factset-export-manifest.json with its standard outputs that details everything one might want to know about the FactSet data files in that export (thanks @AlexAxthelm!)

We should consider whether we should copy the factset-export-manifest.json file to the output directory and/or include the data in the factset-export-manifest.json in the manifest.json that workflow.data.preparation adds to its output directory

⚠️ It would be important to consider whether the implementation forces the factset-export-manifest.json file to be a mandatory input, or if it only does this if the file is available.

AB#10385

AlexAxthelm commented 8 months ago

It would be important to consider whether the implementation forces the factset-export-manifest.json file to be a mandatory input, or if it only does this if the file is available.

The only time that the file would be missing is when the factset files were generated piecemeal (not through export_pacta_files()). I would be suspect about using any such files for data prep (beyond development work), so overall I'm not seeing a problem with making it mandatory, but definitely your call.

AlexAxthelm commented 8 months ago

Also, that manifest file has a lot of info in it that isn't secret, but probably not relevant once we've moved things through data prep (session and envvars being the obvious ones). Maybe we only add some of the keys to the dataprep manifest?

cjyetman commented 8 months ago

The only time that the file would be missing is when the factset files were generated piecemeal (not through export_pacta_files()). I would be suspect about using any such files for data prep (beyond development work), so overall I'm not seeing a problem with making it mandatory, but definitely your call.

I was considering the situation where one downloads a set of FactSet files and copies in the ones they think they need into their inputs directory where their AI inputs also exist (this is typically what I do locally). I get that workflow.factset always exports all of these files to a single directory, but the directory that one inputs to workflow.data.preparation does not necessarily have all of those files. I realize that a user probably "should not" manually populate their inputs directory, but there's really no way that I can think of that workflow.data.preparation can guarantee that a user did not do that.

So the decision would be, do we add factset-export-manifest.json to the README as a mandatory input file and add tests in the script to ensure that the file exists and throw an error if it does not... or do we copy it over if it exists, but not make it strictly necessary for the process to run?

cjyetman commented 8 months ago

Also, that manifest file has a lot of info in it that isn't secret, but probably not relevant once we've moved things through data prep (session and envvars being the obvious ones). Maybe we only add some of the keys to the dataprep manifest?

Happy to have that discussion here, because frankly I don't know what is or is not appropriate and/or potentially secret.

AlexAxthelm commented 8 months ago

I was considering the situation where one downloads a set of FactSet files and copies in the ones they think they need into their inputs directory where their AI inputs also exist (this is typically what I do locally).

I see. That's an entirely reasonable workflow during development. I would say that the happy compromise here is to emit a warning that the file isn't present, and cannot be included. If the user did things manually, then they shouldn't be surprised by it, but if it's in an automated process, then that's a good sign that something is wrong.

AlexAxthelm commented 8 months ago

because frankly I don't know what is or is not appropriate and/or potentially secret.

I'm of the mind that nothing secret should be in a manifest file. But also, I'm of the mind that very few things should be actually secret (pretty much only keys/passwords/etc. Obscurity isn't security).

So the main question I see here is one of relevance.

The top-level keys (and if they would be useful downstream as diagnostic information):

✅ files: file metadata for exported files
✅ data_timestamp: data timestamp used for data pull.
✅ start_time: also part of the unique pull ID.
🚫 export_dir: directory to which files were exported. Probably not useful since files would be moved/copied as part of data prep
🚫 envvars: Settings for Factset pull. Mostly defining connection to the (ephemeral) DB
🚫 session: info about R session in which workflow.factset was run.
🤷 metadata_creation_time_date: when the metadata file was created. effectively a timestamp on when workflow.factset finished (or close to it). Different from start_time, probably by a few hours, but not actually used anywhere.

RMI-PACTA / workflow.data.preparation

consider copying `factset-export-manifest.json` to outputs and/or copying info to `manifest.json` #132