falcontechnologies / wex-2024

Work experience 2024
BSD 2-Clause "Simplified" License
0 stars 0 forks source link

Identify Data Set Change Rate - Bolu #10

Open psftc opened 2 weeks ago

psftc commented 2 weeks ago

Identify the rate of change of the data set. The purpose of this task is to determine which strategy makes the most sense for updating a database with changes to the data set.

Purpose: To identify the rate of change of the various files in the Rebrickable set such that we can select a strategy for updating a database table with changes (in a future issue).

The task should

  1. Collect several samples of each file over several days (it would probably be over several weeks in a normal project).
  2. Take a difference between two files with the same name from different dates. This means you'll need to figure out how to keep multiple copies over time and how to identify differences. You should already have a small program that you can use to get the data set from Rebrickable.

You can choose to write a small program to perform the difference or you can use the 'diff' command line program under a Unix type operating system (Mac OS or Windows WSL) or you can use a difference application or plug in. Microsoft VS Code IDE (Integrated Development Environment) has one built in. It is possible your IDE has one too.

When making the comparison, be aware that sometimes processes can reorder rows in the CSV file. Obviously, you should make the comparison against the unzipped file and not the zipped file. While some tools can perform this task, you want to know how many rows change between days.

You should already have at least one set of this files. Use that as your baseline. Take a second set today and another on Monday. Compare the baseline against today's set, file by file and then compare today's set against the set on Monday. It is highly likely that if you compare the sizes of the files in zipped (or unzipped) form that may be enough to determine that nothing has changed however, prove your hypothesis by doing the diff at least once for each pair of files then make a statement (eg. same size file, manually examine the difference should allow a reasonable guess that if the file is the same size, it probably hasn't changed). Be careful with the Parts file. While it is pretty certain that the Themes and Colors file don't change very often, I have no confidence that this true for Parts and Sets. You'll want to add a comment to this issue with the rate of change as a percentage. eg. baseline to Friday changes Colors.csv.gz: 0%, Parts.csv.gz: 0.5% (200 rows) etc. Alphabetic order is sufficient.

  1. Add a second comment to this issue with a recommendation on the frequency of updates - in a business environment, this data may change daily, weekly, monthly, monthly following a delay to calculate some derived values, quarterly or some other schedule. We want to have some confidence that we can capture changes in a timely manner without wasting computing resources. While this is not really an issue with this data set, if we were looking at gigabytes and 100 thousand or millions of rows, the time and cost can be non-trivial.
  2. Finally, add a third comment to this issue about how your difference method could be automated to test the difference between any two sets and provide a list of files that will need further processing to ingest changed data. You do not have to implement this program to automate the testing of the difference. If you use a tool, you should either investigate how you could use this tool in an automated way (hint: there is probably a command line version of the tool so find the reference for that) or make a very crude estimate of how long and what you could write in your favourite language to achieve this. In this case, it will be a very rough estimate. If you use a command line tool, make a suggestion on a simple algorithm to achieve this in a shell language. You should include the name and version of the tool, the shell you used (and version) and a simple algorithm that might be usable for this shell script. If you wrote your own tool, document the algorithm - either in your README.md or a separate md file and include in your comment in this issue, what additional error checks might be needed (eg. missing file, missing directory or directory pair etc.)