AlexanderSenf / ucla_data_processing

An example for data processing
Apache License 2.0
0 stars 0 forks source link

Address theoretical questions #4

Open AlexanderSenf opened 3 years ago

AlexanderSenf commented 3 years ago
AlexanderSenf commented 3 years ago

The issue with corrupt data is not trivial. Within the limitations of the problem statement, the scope of data corruption is that at most one letter in a 4-letter product code may change randomly. This makes it possible to find the closest existing product code (for example, BENG is closest to BEVG, so it can automatically be corrected). Logically, this means comparing an unknown code against all existing product codes in the database. This is initially implemented by the Levenshtein distance, which produces a score of 75 for a product code with one changed letter. Problems arise when there is ambiguity; when there are more than one product codes scoring 75. Then additional information is required, and there are several possibilities:

  • If only some of the files in the customer purchase database are corrupt, how would you address this problem going forward?

The approach would remain the same as before, with exception handling procedures for cases where a product code does not have a match in the database. This is how the answer to question 3 is currently implemented. Having correct as well as corrupted data available allows for some statistical analysis of existing data, to enable automated correction of erroneous product codes - for example by associating known mistakes to the correct product code, so that there can be a fast lookup instead of a search if there is an error in the product code.

  • What if the database was extremely large?

The key would be to pre-process as much as possible, so that ongoing processing of purchases can use existing index structures and/or ML algorithms to correct errors. And the key to these approaches is to enable updating the index/algorithm as new data is encountered, not requiring an update to start from scratch every time. Whenever large data is involved it may also be useful to create a hierarchy (e.g. by product code letters) or to break the whole database into pieces/shards, to distribute processing among multiple parallel resources, and then combine the results.

I think this probably makes more sense to visualize in the context of large, petabyte-sized genomics databases.

  • How do you prepare for future data corruptions?

Does this mean additional types of corruption (e.g. in subtype)? Or expanded scale of corruption (e.g. two letters wrong in product code)? Or corruption of data already stored in the database (e.g. bit rot)?

I would say that, ideally, any approach to deal with data corruption should be able to scale to future incidents of corrupt data. Anticipating new types of corruption in the future requires the whole system to be designed for these incidents. In general, all incoming data will have to be validated, no assumptions can be made. Otherwise corrupt data would lead to undefined system responses. At minimum:

  • Write a brief summary of your approach

The approach I took for question 3 is much less complex, it doesn't take into account new kinds or errors, or very large databases. The processing script performs some basic input data validation as the data is processed. This happens directly in the function designed to handle each part of the information. In my case, there is one function to take a barcode line and split it into its three components (product code, subtype, unique id). The lenght of each barcode is 30 characters. This is validated by the function reading each line from the file, and the barcode splitting function is not even called unless the string length is correct. So this is not something that needs to be validated again inside of that function. When the barcode is split into its three components, the only bit that should match a given list of elements is the product code. So a test is performed to validate that the product code form the file matches a known product code. This is the location where exception handling is added: if product code lookup fails, there is code to look for any product code that is within 1 character difference. If one is found, that is the product code returned; otherwise an error is raised. This works well as long as there are no ambiguities. It also works fast enough with the small datasets provided. For future expansion I would add a lookup table storing all previously encountered corrupt product codes, so that if the same error happens again, a simple lookup would be sufficient, rather than a search against all product codes.