georgetown-cset / unicorn-topics

0 stars 0 forks source link

Code Review 2020-11-02 #12

Closed jmelot closed 3 years ago

jmelot commented 3 years ago

This code is nice! I really like how you've got this organized.

I skipped preprocessor since the beam code came from me and I already looked at the preprocessing code I adapted from you, but glad to see you added some tests. Also skipped obsolete because obsolete.

I don't think there are any bugs - I put checkboxes in front of stuff that might possibly be a bug.

Nice work!!

In this case it actually is! Because I added the spaces myself in the preprocessing step; they're artificial :). This is probably not the best way to have done this (I should maybe have incorporated the bigrams into the preprocessor) but the preprocessor was written for the old version of the code and the bigram library is part of the new version, so. Ah well.

It is, yes, because I named the organizations in a previous query :).

Hmm, I don't think I'm particularly worried about the order! We didn't end up using them (at least in this brief, we still could) but if we did my plan was to split them again, so order wouldn't matter.

That's a good question that I have a less good answer for than I'd like! The basic answer is:

  1. This is the suggested parameter set in gensim, so it's what I started with.
  2. I tried a bunch of other options. Filtering less appeared to increase the incoherence of my topic models (full of extremely common words that meant nothing). Filtering almost any more gave me meaningless topic models (there weren't many words of interest left to build a model with, I guess). The threshold seemed somewhat flexible, though, (.5 was not hugely different from 0.6, for example) so I went with the suggested parameter, since it worked well. I can explain that in a comment, I guess? It's just not a very satisfactory answer.

This wasn't something that could be tested in an automated way as well as the hyperparameters, since it leads to a whole different vocabulary which requires a lot more additional model pre-computation to handle.

This is an interesting question, because it feels like a case where DRY conflicts with making code easily unit-testable! I'm not calling them in the init because it allows me to unit test each of those methods separately, as well as each init method. If you think the alternative is better (or if there's a way I'm missing to still be able to unit test each one separately!) definitely let me know.

Ah, I guess it would be if you didn't already have a data/intermediate directory, which won't happen until after running the model. Good point, will switch from link to code quote.

Yes, because some of the IBM orgs are like IBM (Australia) or something, and I just want them all labeled IBM.

In Dimensions they're all either IBM or IBM (some place). Which this captures. I looked very closely at the GRID values for each of the big 6 labs to make sure I was capturing them fully!

Realistically, I could probably have just said 'select "Facebook"' the way I did for Google, I guess? So I suppose it wasn't necessary; I think I just hadn't decided at that point I didn't want to try to pull the name from the the actual org value. If the question is -- "why not just use the org_name" the answer is "because Facebook is a part of the org_name always but it's not always the exact org_name."

rggelles commented 3 years ago

Code review updates done! Thanks so much.

jmelot commented 3 years ago

Yay! Close it if you feel you're done, then.

Btw I happened to look back at my comment "would it make sense to put calls to these methods in the init of the superclass so you don't have to repeat them everywhere" and realized it wasn't at all clear about the thing I was really concerned about actually. I can see that you might not want to call these methods in init for one reason or another, fine. The thing I was more concerned about (and somehow completely failed at explaining in my comment) is the fact that in all the subclasses you seem to be overriding the super class's method for preprocess_data and get_data, but then just calling the parent method anyway. E.g. this sort of thing:

    def preprocess_data(self):
        """
        Run the preprocessing functions. If the preprocessing functions
        have already been run, load preprocessed data from pickle.
        :return:
        """
        super(TuneHyperParameters, self).preprocess_data()

All you're doing is calling the parent class's method, so why override at all?

No need to address this if you don't want to, I just found it redundant (possible I'm missing something here though).

rggelles commented 3 years ago

Oh, that's totally fair. I can skip overriding them. I think in my head I was preparing for the possibility and never actually did and then never deleted it. Whoops.