quality of life fixes for data functions

nelson-liu commented 7 years ago

Since i have nothing better to do on a sunday night, I figured i might as well look into why it takes so long to process data before we can train (inspired by a conversation with Mark on friday afternoon).

To start off, I refactored the pad_sequence_to_length function and made it faster by using list.extend versus repeated list.append. While the amortized runtime of list.append is O(1), you lose a ton of efficiency by going back and forth from the python / C levels. On the other hand, list.extend does all of the iteration in C, which makes it a lot faster.

I'll also look into refactoring how we make word + character paddings, because that's also torturously slow right now.

Benchmarks will come in a separate comment.

nelson-liu commented 7 years ago

In [1]: from typing import Any, Callable, Dict, List

In [2]:     def old_pad_sequence_to_length(sequence: List,
   ...:                                desired_length: int,
   ...:                                default_value: Callable[[], Any]=lambda: 0,
   ...:                                truncate_from_right: bool=True) -> List:
   ...:
   ...:         padded_sequence = []
   ...:         for _ in range(desired_length):
   ...:             padded_sequence.append(default_value())
   ...:         sequence_length = min(len(sequence), desired_length)
   ...:         if sequence_length != 0:
   ...:             if truncate_from_right:
   ...:                 padded_sequence[-sequence_length:] = sequence[-sequence_length:]
   ...:             else:
   ...:                 padded_sequence[:sequence_length] = sequence[:sequence_length]
   ...:         return padded_sequence
   ...:

In [3]:     def new_pad_sequence_to_length(sequence: List,
   ...:                                    desired_length: int,
   ...:                                    default_value: Callable[[], Any]=lambda: 0,
   ...:                                    truncate_from_right: bool=True) -> List:
   ...:         if truncate_from_right:
   ...:             truncated_or_extended = sequence[:desired_length]
   ...:         else:
   ...:             truncated_or_extended = sequence[-desired_length:]
   ...:         if len(truncated_or_extended) < desired_length:
   ...:             truncated_or_extended.extend([default_value()] *
   ...:                                          (desired_length -
   ...:                                           len(truncated_or_extended)))
   ...:         return truncated_or_extended
   ...:

In [4]: %timeit old_pad_sequence_to_length([0]*50, 700)
1000 loops, best of 3: 239 µs per loop

In [5]: %timeit new_pad_sequence_to_length([0]*50, 700)
100000 loops, best of 3: 7.66 µs per loop

In [6]: %timeit old_pad_sequence_to_length([0]*300, 700)
1000 loops, best of 3: 241 µs per loop

In [7]: %timeit new_pad_sequence_to_length([0]*300, 700)
100000 loops, best of 3: 8.88 µs per loop

so, looks like it's ~30 times faster to do it this way...ran this on a p2.xlarge instance, since that's what we're running the models on anyway.

nelson-liu commented 7 years ago

hmm, perhaps the method doesn't work as I thought it did. it's really odd to me that setting the truncation mode would affect the end to which padding is added --- is this intended? (this is the current behavior):

In [4]: pad_sequence_to_length([1,2,3,4,5], 2)
Out[4]: [4, 5]

In [5]: pad_sequence_to_length([1,2,3,4,5], 8)
Out[5]: [0, 0, 0, 1, 2, 3, 4, 5]

In [6]: pad_sequence_to_length([1,2,3,4,5], 8, truncate_from_right=False)
Out[6]: [1, 2, 3, 4, 5, 0, 0, 0]

In [7]: pad_sequence_to_length([1,2,3,4,5], 2, truncate_from_right=False)
Out[7]: [1, 2]

I think it would make a lot more sense to either pad on the left or on the right always, and you could probably make this a constructor param. thoughts? If there's a willful reason behind this, I'd be interested in hearing it.

In addition, i would have thought that truncate_from_right=True would mean that you would start chopping off the array starting from the right, when in fact the opposite is true...

Regardless, i've modified this PR to have the current behavior. Running the benchmarks again on the new code yields results that are almost the same as before.

nelson-liu commented 7 years ago

also, is there a reason why we wrote our own tokenization / string processing functions? Those are also a major bottleneck...why not use something like NLTK?

nelson-liu commented 7 years ago

oh interesting, just saw this https://github.com/allenai/deep_qa/blob/master/deep_qa/data/tokenizers/word_splitter.py#L20

Does really simple tokenization. NLTK was too slow, so we wrote our own simple tokenizer instead.

nelson-liu commented 7 years ago

looks like spaCy blows them both out of the water (this is on a 569 token string). Perhaps it would be worth trying out a spaCytokenizer? Should be fairly trivial to implement.

In [15]: %timeit [token for token in en_nlp.tokenizer(lorem.lower())]
1000 loops, best of 3: 522 µs per loop

In [16]: %timeit SimpleWordSplitter().split_words(lorem.lower())
100 loops, best of 3: 5.63 ms per loop

In [17]: %timeit NltkWordSplitter().split_words(lorem.lower())
100 loops, best of 3: 12.7 ms per loop

matt-gardner commented 7 years ago

It's been on my TODO for a long time to look into this. Thanks for doing it. I haven't actually looked at the code yet, because it doesn't look like you want a review yet; just responding to your PR comments.

It's actually pretty important to keep the option to pad from either side. Sometimes you expect the most important content to come at the end of a sequence, so if you have to shorten it, you should throw away the beginning (this is true of question text). Other times you want to throw out the end (this is true of news articles in WDW).

And, yeah, feel free to add the spacy tokenizer; looks like it's a faster, better option than what I wrote. I also thought I did a little better than just 2x NLTK, but I guess not... =)

matt-gardner commented 7 years ago

I also think the location of the 0's might impact Keras' LSTM, though I'm not sure about that - Mark might know something more there.

nelson-liu commented 7 years ago

This is ready for a review. SpaCy's tokenizer is a lot faster, but it also is very conservative. E.g. tokenizing Mr. John Smith will give [Mr., John, Smith] as expected, but tokenizing mr. john smith will give [mr, ., john, smith]. This means that the max lengths can be quite a bit larger than usual. I mentioned this on their repo, hoping to get a response. I think it's a worthy addition anyway, but I didn't set it as the default. if it had the same behavior as NLTK or something i would.

matt-gardner commented 7 years ago

What if you tokenize, then lower? Does it work alright?

nelson-liu commented 7 years ago

yeah, that would work but i'm not sure if we'd lose the speed benefit (which is the reason we're using it in the first place) by going through and lowering everything again. i suppose ill run some benchmarks.

nelson-liu commented 7 years ago

surprisingly, turns out to be almost the same. i suppose i'll do that.

nelson-liu commented 7 years ago

on WDW, the max passage length is 3090 with NLTK / SimpleWordTokenizer and 3539 with SpaCy. hmm...

matt-gardner commented 7 years ago

You might try actually inspecting that passage - could be some rare case that we don't really care about anyway, and we should just be truncating.

allenai / deep_qa

quality of life fixes for data functions #254