test - Githubissues

Hello, @JinwoongJung

I appreciate your detailed feedback.

(1) reproducible with small fixes. I aplogize your time for the check. Originally the code eliminates header by row index but I missed it.

(3) Why did you name this function “upsert”? (pipeline/dao.py#L23) => This name comes from initial plan to DB schema design to set unique key for the case of replay.

(4) Official python docker image Introduction. You're right. The only reason was M1 macbook needs to build on the different platform ('linux/amd64) to test.

(5) Implementation a. with higher data volume? How could you change that part?

Worker2 can use threads to ingest data from GCS to DB. The ingestion performance can be scaled.

However it also needs an parameter to adjust the number of thread to consider overload of the instance and DB. Temporal I/O action is Intentionally avoided for this kind of situation.

b. How would you improve your solution to handle failure of Worker1?

To do that the pipeline needs its own operational table to store its job status.

a job status table = job_name, stage, status
a job operation table, columns = [index_file_name, candidate_object_name, status]

It assumes that storage table’s unique can be setup

unique key on storage table in the database

Here is a whole scenario

Job status: START, COMPLETE, FAILED

Task status:

stage1: Worker1 gets index files

stage2: Worker1 gets candidate object list from each index file & upsert the name into the table and set status ‘START’

When Worker2 completes its task it updates task status ‘COMPLETED’ or ‘FAILED’

Now when you restart Worker1

It checks a job status whether all index files are loaded. if not it retry.
If it has passed stage1, Worker1 put uncompleted candidate object into the queue as a task
Worker2 consumes the queue and restart an ingestion. Since there is an unique key duplicated record does not occur.

Do you have reason to split the task and introduce two workers?

The purpose was scalability.

I assumed there are worker1(container), queue(e.g. Kafka), worker2(container, subscribing the queue) separately on a cloud service.

With this architecture, the number of worker2 container can be adjusted according to the situation.

Worker2 can also use multi thread when there is no restriction from a data provider side and DB write overhead.

How can your DB schema be normalized? (As 1NF, 2NF, 3NF manner) DB schema has id, name, age, institution, activity, comment

c. How can your DB schema be normalized? (As 1NF, 2NF, 3NF manner) DB schema has id, name, age, institution, activity, comment

I would consider primary key is [id, name, age] here.

1NF

A relation is in 1NF. Each column contains an single value.

2NF

Age 는 name에 종속,

we can create another table named person

record: [id, institution, activity, comments]

person: [id, name, age]

3NF

institution and activity의 relation을 고려했으나,

각 institution이 모든 타입의 activity를 갖는 것을 확인

6.Data a. Below two arrays show the ten most common names and its frequencies in our data w/wo titlelize.

It’s very difficult to understand the context of this question. But I add my idea about this problem.

Firstly, we need to check the name field is a valuable column for our data consumers.

If it is important the accidentally duplicated name should not be provided.

If it is not important this is the field to eliminate during ingestion.

We can also see that the name None has over 100 rows solely among other multiple counted names.

a-1. How can we deal with name None? Other names consist of two words concatenation. Only 118 names are None out of 10k name records.

Can I know what kind of issues is related to two words concatenation?

a-2. How can we deal with name The The and THE THE? Should we distinguish two names separately? With only consider the given data, explain your opinion.

There are facts

Normally one name has one record.
There are several names with multiple records.

Here is an assumption

There can be a person who has visited multiple institutions.

However the same name has a too wide age ranges to consider as the same person.

So that those names should be considered undistinguishable individuals.

THE THE and The The are in the group above.

In conclusion we don’t need to distinguish both names since they are already in the unreliable group to speculate.

a-3. If the name tHe thE comes, how can you deal with?

Capital letter is treated as distinguisher. It’s better to keep itself.

b. You dropped data which has null on AGE field. Can you explain why did you take that policy? If all records need to be stored, what approach could you take to meet the requirement?


=> This is a bug.

The policy is no dropping. On README, I've already mentioned "The ingestion logic ensures that records with partly missing data are not dropped."

To meet the intention,
   raising exception(pipeline/utils.py#L19) should be replaced as logging.error() instead.

edyeo / first_app

test #1