Open edyeo opened 1 year ago
6.Data a. Below two arrays show the ten most common names and its frequencies in our data w/wo titlelize.
It’s very difficult to understand the context of this question. But I add my idea about this problem.
Firstly, we need to check the name field is a valuable column for our data consumers.
If it is important the accidentally duplicated name should not be provided.
If it is not important this is the field to eliminate during ingestion.
We can also see that the name None has over 100 rows solely among other multiple counted names.
a-1. How can we deal with name None? Other names consist of two words concatenation. Only 118 names are None out of 10k name records.
Can I know what kind of issues is related to two words concatenation?
a-2. How can we deal with name The The and THE THE? Should we distinguish two names separately? With only consider the given data, explain your opinion.
There are facts
Here is an assumption
There can be a person who has visited multiple institutions.
However the same name has a too wide age ranges to consider as the same person.
So that those names should be considered undistinguishable individuals.
THE THE and The The are in the group above.
In conclusion we don’t need to distinguish both names since they are already in the unreliable group to speculate.
a-3. If the name tHe thE comes, how can you deal with?
Capital letter is treated as distinguisher. It’s better to keep itself.
b. You dropped data which has null on AGE field. Can you explain why did you take that policy? If all records need to be stored, what approach could you take to meet the requirement?
=> This is a bug.
The policy is no dropping. On README, I've already mentioned "The ingestion logic ensures that records with partly missing data are not dropped."
To meet the intention,
raising exception(pipeline/utils.py#L19) should be replaced as logging.error() instead.
Hello, @JinwoongJung
I appreciate your detailed feedback.
(1) reproducible with small fixes. I aplogize your time for the check. Originally the code eliminates header by row index but I missed it.
(3) Why did you name this function “upsert”? (pipeline/dao.py#L23) => This name comes from initial plan to DB schema design to set unique key for the case of replay.
(4) Official python docker image Introduction. You're right. The only reason was M1 macbook needs to build on the different platform ('linux/amd64) to test.
(5) Implementation a. with higher data volume? How could you change that part?
Worker2 can use threads to ingest data from GCS to DB. The ingestion performance can be scaled.
However it also needs an parameter to adjust the number of thread to consider overload of the instance and DB. Temporal I/O action is Intentionally avoided for this kind of situation.
b. How would you improve your solution to handle failure of Worker1?
To do that the pipeline needs its own operational table to store its job status.
It assumes that storage table’s unique can be setup
Here is a whole scenario
Job status: START, COMPLETE, FAILED
Task status:
stage1: Worker1 gets index files
stage2: Worker1 gets candidate object list from each index file & upsert the name into the table and set status ‘START’
When Worker2 completes its task it updates task status ‘COMPLETED’ or ‘FAILED’
Now when you restart Worker1
Do you have reason to split the task and introduce two workers?
The purpose was scalability.
I assumed there are worker1(container), queue(e.g. Kafka), worker2(container, subscribing the queue) separately on a cloud service.
With this architecture, the number of worker2 container can be adjusted according to the situation.
Worker2 can also use multi thread when there is no restriction from a data provider side and DB write overhead.
How can your DB schema be normalized? (As 1NF, 2NF, 3NF manner) DB schema has id, name, age, institution, activity, comment
c. How can your DB schema be normalized? (As 1NF, 2NF, 3NF manner) DB schema has id, name, age, institution, activity, comment
I would consider primary key is [id, name, age] here.
1NF
A relation is in 1NF. Each column contains an single value.
2NF
Age 는 name에 종속,
we can create another table named person
record: [id, institution, activity, comments]
person: [id, name, age]
3NF
institution and activity의 relation을 고려했으나,
각 institution이 모든 타입의 activity를 갖는 것을 확인