RMI-PACTA / workflow.prepare.pacta.indices

This repository is used to run indices through PACTA, and prepare them for transition monitor.
2 stars 0 forks source link

`run_pacta_data_preparation` should export iShares scraping info in a manifest file #26

Open jdhoffa opened 1 year ago

jdhoffa commented 1 year ago

Supersedes https://github.com/RMI-PACTA/pacta.data.preparation/issues/165

Full context copied manually: @jdhoffa:

Potentially useful information:

filename, file_extension, filesize, url, download_time, base_url, archive_url (once the URLs are archived) @cjyetman can you validate which of these fields you think are/ aren't useful to output (or if some are missing)? And also to be clear, this manifest should relate to only the raw URL correct?

Relates to https://github.com/RMI-PACTA/pacta.data.preparation/pull/162


I guess all of these are relevant.... maybe file_extension is a bit overkill.

filename is critical so you know which file you're talking about

filesize is good to have so that you can verify the file you're looking at is actually the same one being described, because the file could have been modified and you wouldn't be able to tell. Maybe a checksum would be better, but that's bit more difficult to verify for an average user

url is the precise location the file was downloaded from. I think this is pretty fundamental to recording the provenance of the data/file

download_time is the precise time that the file was downloaded. This is important because files found at URLs are not necessarily stable, and often change over time, so the URL is not really enough to precisely record the provenance of the data/file

base_url was originally included here because we're capturing a JSON file that technically is not intended for anyone to access directly, and is not linked to or findable by any "normal" web browsing. Instead, the JSON file is used to feed a table on the page found at the "base_url". I have been in situations before where someone else, or a future version of myself, asked "where did you get this from? I can't find it anywhere on that site?", and base_url was the answer.

archive_url if the page is getting archived (on archive.org), this is also a convenience for anyone in the future trying to find this file or update this process, especially if the file has moved or completely disappeared from the site. One would be able to download the file again from this URL, exactly as it was at the time the archive was made. It's also a good indicator the file WAS archived, which is good to know.

These are all things to precisely record the provenance of the file, and facilitate someone in the future trying to understand something about where it came from, what it means, how to find a new version that's equivalent, etc.


Also... I think I had file_extension because sometimes JSON files like this don't even have an extension, because they come from an AJAX request or something... so it's convenient to know what type of file the original developer of the code/archiver expected the file to be, especially if the filename/URL is some random string of characters with no discernible meaning.


cjyetman commented 1 year ago

related RMI-PACTA/workflow.data.preparation#25