Code Review 2020-11-02 - Githubissues

jmelot commented 3 years ago

This code is nice! I really like how you've got this organized.

I skipped preprocessor since the beam code came from me and I already looked at the preprocessing code I adapted from you, but glad to see you added some tests. Also skipped obsolete because obsolete.

I don't think there are any bugs - I put checkboxes in front of stuff that might possibly be a bug.

Nice work!!

[x] https://github.com/georgetown-cset/unicorn-topics/blob/master/generic_topic_model.py#L13 - is it safe to split on only spaces and not other whitespace (e.g. \n)

In this case it actually is! Because I added the spaces myself in the preprocessing step; they're artificial :). This is probably not the best way to have done this (I should maybe have incorporated the bigrams into the preprocessor) but the preprocessor was written for the old version of the code and the bigram library is part of the new version, so. Ah well.

[x] https://github.com/georgetown-cset/unicorn-topics/blob/master/generic_topic_model.py#L54 - safe to keep this in the tmp dataset?
[x] https://github.com/georgetown-cset/unicorn-topics/blob/master/sql/creating_top_organizations.sql#L36 - safe to assume normal capitalization vs matching against lowercase?

It is, yes, because I named the organizations in a previous query :).

[x] https://github.com/georgetown-cset/unicorn-topics/blob/master/sql/creating_coauthors_publication_table.sql#L13 - worth ordering these before concatenating, since you're selecting distinct? Documentation says ordering may be nondeterministic if not specified

Hmm, I don't think I'm particularly worried about the order! We didn't end up using them (at least in this brief, we still could) but if we did my plan was to split them again, so order wouldn't matter.

[x] https://github.com/georgetown-cset/unicorn-topics/blob/master/topic_model_gensim.py#L54 - why these params?

That's a good question that I have a less good answer for than I'd like! The basic answer is:

This is the suggested parameter set in gensim, so it's what I started with.
I tried a bunch of other options. Filtering less appeared to increase the incoherence of my topic models (full of extremely common words that meant nothing). Filtering almost any more gave me meaningless topic models (there weren't many words of interest left to build a model with, I guess). The threshold seemed somewhat flexible, though, (.5 was not hugely different from 0.6, for example) so I went with the suggested parameter, since it worked well. I can explain that in a comment, I guess? It's just not a very satisfactory answer.

This wasn't something that could be tested in an automated way as well as the hyperparameters, since it leads to a whole different vocabulary which requires a lot more additional model pre-computation to handle.

[x] https://github.com/georgetown-cset/unicorn-topics/blob/master/generic_topic_model.py#L35 - you reference these paths in several places, including in a different file here: https://github.com/georgetown-cset/unicorn-topics/blob/master/topic_model_gensim.py#L66 it would be nice to put the parent directory in a variable
[x] https://github.com/georgetown-cset/unicorn-topics/blob/master/topic_model_gensim.py#L102 - comment explaining what's going on here would be nice
[x] https://github.com/georgetown-cset/unicorn-topics/blob/master/topic_model_gensim.py#L103 - I can see that's what you're doing, but why are you doing it? :)
[x] https://github.com/georgetown-cset/unicorn-topics/blob/master/topic_model_gensim.py#L105 - I am confused about this code. If you're only doing something with the first element of the list, why are you enumerating?
https://github.com/georgetown-cset/unicorn-topics/blob/master/topic_model_gensim.py#L142 - good comment
[x] https://github.com/georgetown-cset/unicorn-topics/blob/master/tune_passes.py#L58 - might be nice to have a comment about how you chose these values to test (why these ranges and not another, etc - possibly this would be obvious to any potential users of this code, though)
[x] https://github.com/georgetown-cset/unicorn-topics/blob/master/tune_hyperparameters.py#L27 - would it make sense to put calls to these methods in the init of the superclass so you don't have to repeat them everywhere?

This is an interesting question, because it feels like a case where DRY conflicts with making code easily unit-testable! I'm not calling them in the init because it allows me to unit test each of those methods separately, as well as each init method. If you think the alternative is better (or if there's a way I'm missing to still be able to unit test each one separately!) definitely let me know.

[x] https://github.com/georgetown-cset/unicorn-topics/README.md - (3) https://github.com/georgetown-cset/unicorn-topics/blob/master/plot_topics_by_year.py is a broken link, you meant plot_topic_by_year.py
[x] https://github.com/georgetown-cset/unicorn-topics/blob/master/technical_documentation.md data/intermediate under calculate_perplexity.py is a link that goes nowhere

Ah, I guess it would be if you didn't already have a data/intermediate directory, which won't happen until after running the model. Good point, will switch from link to code quote.

[x] https://github.com/georgetown-cset/unicorn-topics/blob/master/sql/creating_top_organizations.sql#L42 - substr needed?

Yes, because some of the IBM orgs are like IBM (Australia) or something, and I just want them all labeled IBM.

[x] https://github.com/georgetown-cset/unicorn-topics/blob/master/sql/creating_top_organizations.sql#L48 - only appeared as ibm?

In Dimensions they're all either IBM or IBM (some place). Which this captures. I looked very closely at the GRID values for each of the big 6 labs to make sure I was capturing them fully!

[x] https://github.com/georgetown-cset/unicorn-topics/blob/master/sql/creating_top_organizations.sql#L66 - why use regexp_extract?

Realistically, I could probably have just said 'select "Facebook"' the way I did for Google, I guess? So I suppose it wasn't necessary; I think I just hadn't decided at that point I didn't want to try to pull the name from the the actual org value. If the question is -- "why not just use the org_name" the answer is "because Facebook is a part of the org_name always but it's not always the exact org_name."

[x] https://github.com/georgetown-cset/unicorn-topics/blob/master/technical_documentation.md#building-dataset - would be good to say where each table was written out to

rggelles commented 3 years ago

Code review updates done! Thanks so much.

jmelot commented 3 years ago

Yay! Close it if you feel you're done, then.

Btw I happened to look back at my comment "would it make sense to put calls to these methods in the init of the superclass so you don't have to repeat them everywhere" and realized it wasn't at all clear about the thing I was really concerned about actually. I can see that you might not want to call these methods in init for one reason or another, fine. The thing I was more concerned about (and somehow completely failed at explaining in my comment) is the fact that in all the subclasses you seem to be overriding the super class's method for preprocess_data and get_data, but then just calling the parent method anyway. E.g. this sort of thing:

    def preprocess_data(self):
        """
        Run the preprocessing functions. If the preprocessing functions
        have already been run, load preprocessed data from pickle.
        :return:
        """
        super(TuneHyperParameters, self).preprocess_data()

All you're doing is calling the parent class's method, so why override at all?

No need to address this if you don't want to, I just found it redundant (possible I'm missing something here though).

rggelles commented 3 years ago

Oh, that's totally fair. I can skip overriding them. I think in my head I was preparing for the possibility and never actually did and then never deleted it. Whoops.

georgetown-cset / unicorn-topics

Code Review 2020-11-02 #12