EwoutH / shipping-data

A public collection of shipping data from South Africa to The Netherlands
GNU General Public License v3.0
1 stars 5 forks source link

Incorporate Timeliness of data #39

Open imvs95 opened 1 year ago

imvs95 commented 1 year ago

Old data is less up-to-date than new data for webscrapers

EwoutH commented 1 year ago

I think this could best be done by adding a "date collected" column, with the date on which each data row is collected. The advantage is that we keep all the raw data this way, and can track how much data is. Then we could modify scripts to either keep both or use the newest data available when combining data.

Now that I think of it, alle the raw data files already contain the date in their name. So a separate column in the raw data isn't needed. Maybe the combined data could contain columns with "First detected" and "Last updated" dates.

This also depends on what our criterea are for two routes to be the same, and of course this could change over time.

For now I see no immediate action, since all collected data is already date-stamped in the file name. So I agree with a low priority on this.