andymeneely / chromium-history

Scripts and data related Chromium's history
11 stars 4 forks source link

What are the most interesting queries for the NLP data? #226

Closed andymeneely closed 9 years ago

andymeneely commented 9 years ago

Let's explore the ways we can use this NLP data. Come up with a set of interesting queries and post them here. (Rails console is better). Things to get you started:

Also, take a look at the context of some of these, especially the popular technical terms. For example, we talked about how "decide" is used in context.

By Feb 18

@bspates you might be interested in watching this.

andymeneely commented 9 years ago
Message.joins(:technical_words).limit(500).select(:code_review_id,:word).order(:code_review_id)
andymeneely commented 9 years ago

Joining Messages to Bug labels looks like this:

Message.joins(:technical_words, code_review: {commit: {commit_bugs: {bug: :labels}}}).limit(50).select(:word, :label)
Message.joins(:technical_words, code_review: {commit: {commit_bugs: {bug: :labels}}}).select(:label, 'count(word)').group(:label).limit(50)
bspates commented 9 years ago

@rikal I have added indexes to the the linking tables. I should mention that they won't provide any performance improvement on joins unless the join is made with a subset of the table.

rikal commented 9 years ago

Top technical words in messages:

TechnicalWord.joins(:messages).group("technical_words.id").order("COUNT(technical_words.id) DESC").limit(10).select(:word, "technical_words.id", "COUNT(technical_words.id)")

Technical words associated with too many labels:

TechnicalWord.joins(messages: {code_review: {commit: {commit_bugs: {bug: :labels}}}}).group("technical_words.id").order("COUNT(labels.label) DESC").limit(50).select("technical_words.id",:word,"COUNT(labels.label)")

Top words and associated labels:

ids = TechnicalWord.joins(:messages).group("technical_words.id").order("COUNT(technical_words.id) DESC").limit(10).pluck("technical_words.id")

TechnicalWord.joins(messages: {code_review: {commit: {commit_bugs: {bug: :labels}}}}).where(id: ids).limit(10)
TechnicalWord.joins(messages: {code_review: {commit: {commit_bugs: {bug: :labels}}}}).group("technical_words.id","labels.label").order("COUNT(*) DESC").limit(10).select(:word,"labels.label","COUNT(*)")

Messages with the most technical words :

Message.joins(:technical_words).group(:message_id).order("COUNT(*) DESC").limit(50).select(:message_id, "COUNT(*)")

Messages with the least technical words:

Message.joins(:technical_words).group(:message_id).order("COUNT(*) ASC").limit(50).select(:message_id, "COUNT(*)")

Average tech words per message_id

rikal commented 9 years ago

Identify strong correlation between labels and technical words Identify technical word, message, bug label correlation over time

rikal commented 9 years ago

Avg. tech words per review with no bug association

((CodeReview.joins(messages: :technical_words).where.not(commit_hash: CommitBug.pluck('DISTINCT commit_hash')).pluck(:word) +
CodeReview.joins(patch_sets: {patch_set_files: {comments: :technical_words}}).where.not(commit_hash: CommitBug.pluck('DISTINCT commit_hash')).pluck(:word)).count).to_f/CodeReview.where.not(commit_hash: CommitBug.pluck('DISTINCT commit_hash')).count

Avg. tech words per review associated with bugs

((CodeReview.joins(messages: :technical_words).where(commit_hash: CommitBug.pluck('DISTINCT commit_hash')).pluck(:word) +
CodeReview.joins(patch_sets: {patch_set_files:{comments: :technical_words}}).where(commit_hash: CommitBug.pluck('DISTINCT commit_hash')).pluck(:word)).count).to_f/CodeReview.where(commit_hash: CommitBug.pluck('DISTINCT commit_hash')).count

Tech words only in msgs for reviews with bugs associated

((CodeReview.joins(messages: :technical_words).where(commit_hash: CommitBug.pluck('DISTINCT commit_hash')).pluck('DISTINCT word') + CodeReview.joins(patch_sets: {patch_set_files {comments: :technical_words}}).where(commit_hash: CommitBug.pluck('DISTINCT commit_hash')).pluck(:word)) - (CodeReview.joins(messages: :technical_words).where.not(commit_hash: CommitBug.pluck('DISTINCT commit_hash')).pluck('DISTINCT word') + CodeReview.joins(patch_sets: {patch_set_files {comments: :technical_words}}).where.not(commit_hash: CommitBug.pluck('DISTINCT commit_hash')).pluck(:word)))

Tech words only in msgs for reviews without bugs associated

((CodeReview.joins(messages: :technical_words).where.not(commit_hash: CommitBug.pluck('DISTINCT commit_hash')).pluck('DISTINCT word') + CodeReview.joins(patch_sets: (patch_set_files (comments: :technical_words))).where.not(commit_hash: CommitBug.pluck('DISTINCT commit_hash')).pluck(:word)) - (CodeReview.joins(messages: :technical_words).where(commit_hash: CommitBug.pluck('DISTINCT commit_hash')).pluck('DISTINCT word') + CodeReview.joins(patch_sets: {patch_set_files: {comments: :technical_words}}).where(commit_hash: CommitBug.pluck('DISTINCT commit_hash')).pluck(:word)))
rikal commented 9 years ago

Developers who were a part of a review in a release given release (release range)

rel_range = Release.where(name: '5.0').pluck(:date)..Release.where(name: '11.0').pluck(:date)
ids = Participant.where(review_date: rel_range).pluck(:dev_id) + Reviewer.joins(:code_review).where(code_reviews: {created: rel_range}).pluck(:dev_id)

Unique technical_words by developers in a given release

uniq = Message.joins(:technical_words).where(date: rel_range, sender_id: ids).pluck('distinct word') + PatchSet.joins(patch_set_files: {comments: :technical_words}).where(comments: {date: rel_range, author_id: ids}).pluck('distinct word')

Avg. tech word use per review post in a release

(Message.joins(:technical_words).where(date: rel_range, sender_id: ids).count + PatchSet.joins(patch_set_files: {comments: :technical_words}).where(comments: {date: rel_range, author_id: ids}).count).to_f / (Message.where(date: rel_range, sender_id: ids).count + PatchSet.joins(patch_set_files: :comments).where(comments: {date: rel_range, author_id: ids}).count)

Increased use of words from previous release

new average use - old average use
andymeneely commented 9 years ago

Which ones of these work and which don't?

rikal commented 9 years ago

All the new ones work when I limit them to a few records. I'll add them to build for full results. let me know how they look though.

bspates commented 9 years ago

@andymeneely and @rikal you will have to drop any use of the comments in relation to the technical word table as I have removed that relation due to the data duplication in the messages table.

rikal commented 9 years ago

New words by devs in their vocab in a new release

old_words = Message.joins(:technical_words).where(date: old_range, sender_id: ids).pluck('distinct word') + PatchSet.joins(patch_set_files: {comments: :technical_words}).where(comments: {date: old_range, author_id: ids}).pluck('distinct word')
old_usage = Message.joins(:technical_words).where(date: old_range, sender_id: ids).group(:sender_id, :word).pluck(:sender_id, :word) + PatchSet.joins(patch_set_files: {comments: :technical_words}).where(comments: {date: old_range, author_id: ids}).group(:author_id, :word).pluck(:author_id, :word)
new_usage = Message.joins(:technical_words).where(date: new_range, sender_id: ids, word: old_words).group(:sender_id, :word).pluck(:sender_id, :word) + PatchSet.joins(patch_set_files: {comments: :technical_words}).where(comments: {date: new_range, author_id: ids}, technical_words: {word: old_words}).group(:author_id, :word).pluck(:author_id, :word)
new_usage - old_usage
bspates commented 9 years ago

Stuff: