Open tansaku opened 5 years ago
note that I get the blue screen and error when trying to create a new account on staging - could this be related to some connection between the rails and cruncher apps?
on staging at attempt to match job seekers generates the following error:
looking at the code on line 291:
def match_job_seekers
authorize @job
Pusher.trigger('pusher_control', <-- line 291
'spinner_start',
user_id: pets_user.user.id,
target: '.table.table-bordered')
# Get job match scores for all job Seekers
result = ResumeCruncher.match_resumes(@job.id)
it looks like this pusher operation is failing before we even get to contacting the cruncher
working with Chet, the issue was occurring when clicking "match all job seekers" on an individual job - it's now getting further along and we are seeing things like this:
Chet says this should return multiple matches that we can click through on - but this "not available" is weird - it does appear to be working for "match your job seekers against job" when individual job seekers are selected
seeing this behaviour locally:
similar to that on production - is this a zero match?
locally we get this back from the cruncher:
{"resultCode"=>"SUCCESS", "message"=>"Success", "stars"=>{"NaiveBayes"=>0.0, "ExpressionCruncher"=>0.0}}
added some debugging:
@tansaku looks like that in your last test you are Matching a specific Job Seeker against a Specific Job, and the result is that the Job Seeker is a 0 star match to the job
thanks @joaopapereira - so as we discussed it seems like at the moment everyone gets zero star matches as far as we can tell, but as we looked at together you were saying that the problem was with the presence or absence of particular job categories.
We looked at the staging and production crunchers and saw that the settings look okay from:
db.settings.find()
we also looked at the jobs
db.job.find()
and saw data like the following:
{ "_id" : ObjectId("5bfd88c246e0fb000b646d37"), "_class" : "org.metplus.curriculum.database.domain.Job", "title" : "Family / Pediatric Practice Nurse Practitioner ", "jobId" : "3", "description" : "to provide primary health care services, including health promotion, disease prevention, and interdisciplinary collaboration. Seeking licensed, professional expertise in history taking, physical examinations, immunizations, non-invasive diagnostic tests, and pharmacology, within a behavioral health agency. May include home visitation. ", "titleMetaData" : { "_id" : null, "metaData" : { "NaiveBayes" : { "_id" : null, "bestMatchCategory" : "secreatary", "totalProbability" : 0.000552928599063307, "fields" : { "sales manager" : { "data" : 0.000125838938402012 }, "administrative" : { "data" : 0.0000038133011912577786 }, "line cook" : { "data" : 0.00022651006293017417 }, "administrative assistent" : { "data" : 0.0001715985417831689 }, "software developer" : { "data" : 0.0004474273300729692 }, "secreatary" : { "data" : 0.000552928599063307 } }, "_class" : "org.metplus.curriculum.cruncher.naivebayes.NaiveBayesMetaData" }, "ExpressionCruncher" : { "_id" : null, "mostReferedExpression" : "practice", "fields" : { "practice" : { "data" : 1 }, "practitioner" : { "data" : 1 }, "nurse" : { "data" : 1 }, "family" : { "data" : 1 }, "/" : { "data" : 1 }, "pediatric" : { "data" : 1 } }, "_class" : "org.metplus.curriculum.cruncher.expressionCruncher.ExpressionCruncherMetaData" } } }, "descriptionMetaData" : { "_id" : null, "metaData" : { "NaiveBayes" : { "_id" : null, "bestMatchCategory" : "sales manager", "totalProbability" : 7.810418738521035e-19, "fields" : { "sales manager" : { "data" : 7.810418738521035e-19 }, "administrative" : { "data" : 1.3730280890768393e-30 }, "line cook" : { "data" : 5.847479126716874e-28 }, "administrative assistent" : { "data" : 2.2606622988101377e-22 }, "software developer" : { "data" : 4.0588718682211377e-19 }, "secreatary" : { "data" : 1.7454550937515912e-27 } }, "_class" : "org.metplus.curriculum.cruncher.naivebayes.NaiveBayesMetaData" }, "ExpressionCruncher" : { "_id" : null, "mostReferedExpression" : "health", "fields" : { "taking," : { "data" : 1 }, "visitation" : { "data" : 1 }, "expertise" : { "data" : 1 }, "professional" : { "data" : 1 }, "non-invasive" : { "data" : 1 }, "behavioral" : { "data" : 1 }, "prevention," : { "data" : 1 }, "diagnostic" : { "data" : 1 }, "physical" : { "data" : 1 }, "tests," : { "data" : 1 }, "immunizations," : { "data" : 1 }, "include" : { "data" : 1 }, "services," : { "data" : 1 }, "including" : { "data" : 1 }, "disease" : { "data" : 1 }, "agency" : { "data" : 1 }, "may" : { "data" : 1 }, "in" : { "data" : 1 }, "within" : { "data" : 1 }, "pharmacology," : { "data" : 1 }, "health" : { "data" : 3 }, "licensed," : { "data" : 1 }, "promotion," : { "data" : 1 }, "history" : { "data" : 1 }, "examinations," : { "data" : 1 }, "seeking" : { "data" : 1 }, "interdisciplinary" : { "data" : 1 }, "home" : { "data" : 1 }, "provide" : { "data" : 1 }, "collaboration" : { "data" : 1 }, "to" : { "data" : 1 }, "primary" : { "data" : 1 }, "care" : { "data" : 1 } }, "_class" : "org.metplus.curriculum.cruncher.expressionCruncher.ExpressionCruncherMetaData" } } } }
where the categories of jobs are things like:
"secreatary"
"sales manager"
"administrative"
"line cook"
"software developer"
and that resumes are put into categories - so if there isn't a good category for a resume then it won't get matched at all.
I was just looking at the resumes in the db. There's one that contains the term "pediatric", but it's categorised as follows:
{ "_id" : ObjectId("5beedae546e0fb000b646d1a"), "_class" : "org.metplus.curriculum.database.domain.Resume", "filename" : "Nursing resume NB.docx", "fileType" : "docx", "userId" : "15", "metaData" : { "NaiveBayes" : { "_id" : null, "bestMatchCategory" : "administrative assistent", "totalProbability" : 0, "fields" : { }, "_class" : "org.metplus.curriculum.cruncher.naivebayes.NaiveBayesMetaData" }, "ExpressionCruncher" : { "_id" : null, "mostReferedExpression" : "·", "fields" : { "epic," : { "
i.e. an administrative assistent with zero probability so I guess it would never get any more than a zero match ...
would it be a good idea to have a back up matching that just matched common words rather than relying on categories?
From what we just talked on the hangout I believe 2 problems can be happening:
How to check option 1:
db.settings.find({})
This will retrieve all the settings and you should looks for the cruncherSettings['naiveBayes']
to see if it as data.How to check option 2:
db.job.find({"jobId": "XXXX"})
with the XXXX
being the job identifier.titleMetaData['naiveBayes']
and descriptionMetaData['naiveBayes']
to make sure they have data on themdb.resume.find({"userId": "XXXX"})
with the XXXX
being the user identifier.metaData['naiveBayes']
to make sure they have data on it.How to check option 3:
Eventually the categories that make sense for the current Job of Resume might not exist as valid categories in the cruncher. At this point in time the categories that we have are: | Category name |
---|---|
software developer | |
line cook | |
secretary | |
sales manager | |
administrative assistant | |
administrative |
Note: The fact that the resume or job does not 100% match a category does not mean that it might not have a percentage of probability of being part of one of the categories but lets image the case: A Nurse might have a higher probability of being an administrative than of being a line cook, but the reverse is also possible. Meanwhile a job description for a Nurse might have an higher probability of being matched with Sales Manager, this way the probability of the Resume match the Job is much smaller or even 0 depending on the probabilities.
Solve problem 1:
db.settings.remove({})
Solve problem 2:
/api/v2/job/reindex
get
request to /api/v2/resume/reindex
Solve problem 3:
NOTE: The fact that one word or another exist in both Resume and Job information does not mean that they will even be a 1 ⭐️ match. This is all a statistic analysis based on the information that we first provided to the cruncher(Brain).
thanks @joaopapereira - that's very helpful
on staging and production we can see all the brain information.
I can't see it locally, but the technique you describe (option 1) is not causing it to be pulled in when we restart the app after deleting the settings. @sherspock and I were examining the code in NaiveBayesCruncher.java
private void load() throws CruncherSettingsNotFound {
LOG.info("Loading settings");
cruncherImpl.resetMemory();
try {
CruncherSettings settings;
try {
LOG.info("Get settings, I mean really!!!!");
settings = repository.findAll().iterator().next().getCruncherSettings(CruncherImpl.CRUNCHER_NAME);
LOG.info("Got settings");
} catch(NoSuchElementException e) {
LOG.warn("Could not find cruncher");
settings = new CruncherSettings(CruncherImpl.CRUNCHER_NAME);
Settings globalSettings = repository.findAll().iterator().next();
settings.addSetting(new Setting<>(LEARN_DATABASE, learnDatabase));
settings.addSetting(new Setting<>(CLEAN_EXPRESSIONS, cleanExpressions));
globalSettings.addCruncherSettings(CruncherImpl.CRUNCHER_NAME, settings);
repository.save(globalSettings);
LOG.info("saved global settings");
}
LOG.info("Database settings: " + settings);
LOG.info("Local settings learn database: " + learnDatabase);
LOG.info("Local settings clean expressions: " + cleanExpressions);
and trying to work out how to force a reload of all the resume data, and came up with this (although it's not quite doing what we want as it deletes the entire cruncherSettings element):
> db.settings.update({ _id: ObjectId("5c2f64bbb5615fc80189651f") }, { $unset : { "cruncherSettings" : { "NaiveBayes" : 1} }})
> db.settings.find()
{ "_id" : ObjectId("5c2f64bbb5615fc80189651f"), "_class" : "settings", "CRUNCHER_SETTINGS_NAME" : "CRUNCHER_SETTINGS_NAME", "appSettings" : { "_id" : null, "settings" : { "test" : { "name" : "test", "data" : "haha" } }, "mandatory" : [ "test" ] } }
then when restarting we saw all the resume data being pulled in on the console:
2019-01-04 14:02:31.713 INFO 51716 --- [ main] o.m.c.c.naivebayes.NaiveBayesCruncher : Get settings, I mean really!!!!
2019-01-04 14:02:31.726 INFO 51716 --- [ main] o.m.c.c.naivebayes.NaiveBayesCruncher : Got settings
2019-01-04 14:02:31.726 INFO 51716 --- [ main] o.m.c.c.naivebayes.NaiveBayesCruncher : Database settings: SettingsList: {settings: {LearnDatabase: org.metplus.curriculum.database.domain.Setting@7a4d582c,Name: org.metplus.curriculum.database.domain.Setting@5626d18c,}, mandatory: [Name,]
2019-01-04 14:02:31.728 INFO 51716 --- [ main] o.m.c.c.naivebayes.NaiveBayesCruncher : Local settings learn database: {software developer=[Sr. Angular UI Developer Developer : Experience with streaming aps, experience with trading applications very helpful.Description:The Active Trader Client Applications team is responsible for the Active Trader StreetSmart family of products. As part of our continuous investment in the StreetSmart pla
but we still don't see it in the mongodb when we run db.settings.find, which persists in displaying the following:
[tansaku@Samuels-MBP:~/Documents/Github/AgileVentures/resumeCruncher (master)]$
→ mongo
MongoDB shell version: 3.2.9
connecting to: test
> use resumeCruncher
switched to db resumeCruncher
> db.settings.find()
{ "_id" : ObjectId("5c2f64bbb5615fc80189651f"), "_class" : "settings", "CRUNCHER_SETTINGS_NAME" : "CRUNCHER_SETTINGS_NAME", "cruncherSettings" : { "bamm" : { "_id" : null, "NAME_SETTING" : "Name", "settings" : { "Name" : { "name" : "Name", "data" : "New cruncher" } }, "mandatory" : [ "Name" ] }, "NaiveBayes" : { "_id" : null, "NAME_SETTING" : "Name", "settings" : { "LearnDatabase" : { "name" : "LearnDatabase" }, "Name" : { "name" : "Name", "data" : "NaiveBayes" } }, "mandatory" : [ "Name" ] }, "ExpressionCruncher" : { "_id" : null, "NAME_SETTING" : "Name", "settings" : { "CaseSensitive" : { "name" : "CaseSensitive", "data" : false }, "IgnoreList" : { "name" : "IgnoreList", "data" : [ "a", "or", "and", "then", "must", "least", "i", "am", "of", "but", "our", "mine", "very", "worked", "decided", "each", "an", "as", "at", "on" ] }, "IgnoreListWordSearch" : { "name" : "IgnoreListWordSearch", "data" : true }, "Name" : { "name" : "Name", "data" : "ExpressionCruncher" }, "MergeList" : { "name" : "MergeList", "data" : { "cook" : [ "cook", "line cook" ], "software@@@@@development" : [ "software development", "software development lifecycle" ] } } }, "mandatory" : [ "Name" ] } }, "appSettings" : { "_id" : null, "settings" : { "test" : { "name" : "test", "data" : "haha" } }, "mandatory" : [ "test" ] } }
so we're unclear if we've got our local develop system in a way to accurately diagnose the bug.
In production at least we think the resume and job have been crunched in that they appear in the mongodb.
If we want to add additional categories how do we do that?
just updating, that if we want to precisely remove the naiveBayes element from the settings we can do that with:
db.settings.update({ _id: ObjectId("5c2f64bbb5615fc80189651f") }, { $unset : { "cruncherSettings.NaiveBayes" : 1} })
however re-starting the cruncher after doing this did not lead to any resume being shown in the main output log, and also nothing in the mongodb, which was back to having NaiveBayes like this:
"NaiveBayes" : {
"_id" : null,
"NAME_SETTING" : "Name",
"settings" : {
"LearnDatabase" : {
"name" : "LearnDatabase"
},
"Name" : {
"name" : "Name",
"data" : "NaiveBayes"
}
},
"mandatory" : [
"Name"
]
},
although restarting again they were shown being loaded in on the main console output, but again, nothing in the mongodb itself ...
@tansaku that is strange.... There was a bug that was solved maybe 4 weeks ago that was not storing/reading information from the mongo database do you have the latest commit? 219cc009a
Nevertheless the behavior is strange I prefer to just remove the full settings because it ensures a clean slate and the rest of the info have no way to be changed for now.
To add new categories we need to have a batch of Resumes and Jobs that match a specific category and then add them to https://github.com/AgileVentures/MetPlus_resumeCruncher/blob/development/app/src/main/resources/application.yml#L45
The files need to be converted into a string and new line converted into \n
the yaml file looks like this:
naive-bayes:
learn-database:
"new category that we want to add":
- "This is the first resume \n as \n a string"
- "This is a job description that we have\n for this new category"
.....
@joaopapereira understood - but does that mean adding people's potentially private resumes to a public git repository?
and I have updated to the latest cruncher and the correct data is now showing up in the mongodb locally ...
What would be great would be if the seed data had at least one user with resume that matched at least one job so we could see the possibility of it working locally ... maybe one of the existing jobs does match tom seeker?
@tansaku the ones we have there were picked up from examples on the web and some heavily redacted ones that Chet made available
right @joaopapereira but for the new pediatric nurse ones in the production system they are real ones that chet is uploading, so if we wanted to use those we'd have to get him to redact, or approve our redacted versions of them, if we wanted to add them into application.yml, no?
yes @tansaku. Eventually when we are in a world where we are more stable we can feed this information into the database and no longer use the application.yml
Chet said:
Joao asked:
Chet replied:
Joao asked:
Chet replied:
Joao asked:
Chet replied: