ArshKA / LinkedIn-Job-Scraper

LinkedIn scraper to retrieve and store a live stream of job postings
68 stars 20 forks source link

Prevent adding company to employee_counts when last recorded time is less than X days #4

Closed AndhikaWB closed 9 months ago

AndhikaWB commented 9 months ago

When there's too many job postings from the same company at the time of job searching, that company will be added multiple times when executing details retriever.

  1. There should be a check to prevent adding it again if the time is too short, or
  2. We could use a separate script file to remove duplicated data and reorder the primary key, with time (e.g. in days) as the parameter. This should be only executed after everything's done since it's expensive if executed every time a data is received.

See screenshot below: image

AndhikaWB commented 9 months ago

The problem with option 1 is INSERT INTO doesn't support WHERE clause (so complex sub-query will be needed, if that's even possible), the problem with option 2 is we need to make separate script, where the steps should be like this:

  1. DELETE record_id or use SELECT without record id
  2. Remove duplicated rows using DISTINCT
  3. Remove some rows where the time recorded is too short using WHERE clause? (optional, may be pretty hard using SQL alone)
  4. Readd primary key using something like this (not sure whether SQLite support it or not)
AndhikaWB commented 9 months ago

Or we can also use millisecond/microsecond as the time and make the time_recorded as primary key, but we still need to do above steps to fix previous values on the table. This solution will also be broken once quantum computer is released lol.

I think it's easier to fix these problems using Pandas rather than SQL itself.

ArshKA commented 9 months ago

Yeah, for future values I think turning the time_recorded and company_id as the primary key would be sufficient. For the data already uploaded, I'll just create a sql query to remove rows with the same company_id and time. That should work, right?

ArshKA commented 9 months ago

I'll update that tomorrow

AndhikaWB commented 9 months ago

Yeah, for future values I think turning the time_recorded and company_id as the primary key would be sufficient.

Isn't primary key supposed to be unique? I've never used 2 primary keys before, but I think each primary key is representing it's own column (not grouped together if there are multiple primary keys). In your case, making time_recorded as primary key will make time_recorded unique, which may not allow duplicated time_recorded like screenshot below? Or I'm learning something wrong this whole time? image

Also making company_id unique may also cause loss of information if you intend to make it a time series data in the future. People can measure company growth, mass-layoffs, or turn-over rate. Though I guess it's a company specific thing so it will rarely be needed (e.g. only when applying on that company).

It's up to you I guess, I myself rarely need that data.

ArshKA commented 9 months ago

I just changed it so time_recorded and company_id are a composite ID. This means there shouldn't be any duplicate rows with the same time and company id. Multiple rows can have the same time and the same id, just not both at the same time. This should prevent the problem you initially describes with multiple consecutive jobs by the same company filling up unnecessarily space.

AndhikaWB commented 9 months ago

Oh right, composite primary key. In that case you should undo changes made by me on the previous commit, see 0ed9cb1 (create_db.py).