Closed AndhikaWB closed 9 months ago
The problem with option 1 is INSERT INTO
doesn't support WHERE
clause (so complex sub-query will be needed, if that's even possible), the problem with option 2 is we need to make separate script, where the steps should be like this:
Or we can also use millisecond/microsecond as the time and make the time_recorded as primary key, but we still need to do above steps to fix previous values on the table. This solution will also be broken once quantum computer is released lol.
I think it's easier to fix these problems using Pandas rather than SQL itself.
Yeah, for future values I think turning the time_recorded and company_id as the primary key would be sufficient. For the data already uploaded, I'll just create a sql query to remove rows with the same company_id and time. That should work, right?
I'll update that tomorrow
Yeah, for future values I think turning the time_recorded and company_id as the primary key would be sufficient.
Isn't primary key supposed to be unique? I've never used 2 primary keys before, but I think each primary key is representing it's own column (not grouped together if there are multiple primary keys). In your case, making time_recorded
as primary key will make time_recorded
unique, which may not allow duplicated time_recorded
like screenshot below? Or I'm learning something wrong this whole time?
Also making company_id
unique may also cause loss of information if you intend to make it a time series data in the future. People can measure company growth, mass-layoffs, or turn-over rate. Though I guess it's a company specific thing so it will rarely be needed (e.g. only when applying on that company).
It's up to you I guess, I myself rarely need that data.
I just changed it so time_recorded and company_id are a composite ID. This means there shouldn't be any duplicate rows with the same time and company id. Multiple rows can have the same time and the same id, just not both at the same time. This should prevent the problem you initially describes with multiple consecutive jobs by the same company filling up unnecessarily space.
Oh right, composite primary key. In that case you should undo changes made by me on the previous commit, see 0ed9cb1 (create_db.py).
When there's too many job postings from the same company at the time of job searching, that company will be added multiple times when executing details retriever.
See screenshot below:![image](https://github.com/ArshKA/LinkedIn-Job-Scraper/assets/17119394/fc430f0d-d1a8-47af-bcfa-9452011606a2)