AptlyOrg / jobs

1 stars 1 forks source link

Greedy Indeed Search #5

Open AdminAtAptly opened 6 years ago

AdminAtAptly commented 6 years ago

We are having an issue with a greedy search against the Indeed API. I suspect this issue applies to all extracts, but it's most glaring in the extracts created by the "jenkins-jobs-by-company".

Although the search params include the specific params used by their own Internal apps for delivering jobs to their company sites, it seems other companies land in their results. Looks like they are using a wildcards with a full text search.

Although there may be an easy hack to the API to avoid this condition, I think for now we just need a standalone job that can be appended to each of the Jenkins jobs, that takes a pass against all ./archive/*.json files and assures that all jobs have a "company" value that is in the aptly-ho.cfg, and a "state" value in the aptly-st.cfg. If it passes both of these tests keep it, if not lose it.

A couple of recurring failures have been found -- "Mercantil Commerce Bank" and "The Commerce Bank", both would fail both of these tests.

Worth noting: all filenames at the completion of the run must end up the same as they are now, since the UI is driven by the specific filename scheme. Also, we need all temp files cleaned up to avoid production cruft.

Feel free to commit back to the repo any test artifacts that can be used for future unit testing and I will wire them up to our testing tools, but unit testing is not a required here.

AdminAtAptly commented 6 years ago

@Traizen after thinking it over a bit, before you begin any work on this one, or future projects, you might get in habit of forking the repo on Github first, then have your working branches own forked repository. This does a few things

  1. It gives you a safe place to commit your changes often. If you run into a dead end you can just fork from Aptly's dev branch again and head down another path. The main point is that you can (and should) experiment with things and save your work often.
  2. If you save your work every day that you program you will never lose work, but more importantly I will be able to see the work you are doing and offer suggestions along the way.
    I can even commit work to your branch in some cases. All of this can be done with out mucking up the project history. Your experiments tend to get really ugly for others on the team to wade through.(Eventually we will discuss how to squash commits so that you send only the most relevant commits back to the repo for everyone to see, but for now I think forking is just cleaner)
  3. Lastly, by doing things this way each and every commit will be captured on your contribution map, which ultimately which ultimately we'll turn into your public record of the work that you have done. If done properly it will show a strong track record of your work and, imo, one of the strongest demonstrations of your abilities and work habits to future employers. It also gives you instant credibility if you decide to work on public projects and you want them to accept your code.

When you get code to a place that you are ready to push code into production then you can go back to your fork on github and do a pull request back to dev, which will basically send a request to me or Joe to merge your changes back into our production workflow and we will either accept and merge or changes in, or make suggestions back to you necessary before we can merge.

AdminAtAptly commented 6 years ago

I was spot checking JSON records last night and stumbled across a field on the job record, called "source". I think this might be useful for this issue.

At first blush, it might be the key to distinguishing legitimate records for each of the aptly-ho.cfg. All the records I noticed as bad(e.g. "Mercantil Commerce Bank") had either "Indeed" or something other than the aptly-ho.cfg entry. If true, that could enable resolution through a simple modification to the Indeed API call from jobs.sh instead of a new script and an additional pass.

Although there is little impact to an additional pass right now, at scale it could become heavy so if a separate pass can be avoided it is probably a good thing.

Traizen commented 6 years ago

I'm a little bit lost, but I'm a little closer, I just don't understand the config entries, where can I find the config files aptly-ho.cfg and aptly-st.cfg?