Closed nelson-liu closed 7 years ago
In [1]: from typing import Any, Callable, Dict, List
In [2]: def old_pad_sequence_to_length(sequence: List,
...: desired_length: int,
...: default_value: Callable[[], Any]=lambda: 0,
...: truncate_from_right: bool=True) -> List:
...:
...: padded_sequence = []
...: for _ in range(desired_length):
...: padded_sequence.append(default_value())
...: sequence_length = min(len(sequence), desired_length)
...: if sequence_length != 0:
...: if truncate_from_right:
...: padded_sequence[-sequence_length:] = sequence[-sequence_length:]
...: else:
...: padded_sequence[:sequence_length] = sequence[:sequence_length]
...: return padded_sequence
...:
In [3]: def new_pad_sequence_to_length(sequence: List,
...: desired_length: int,
...: default_value: Callable[[], Any]=lambda: 0,
...: truncate_from_right: bool=True) -> List:
...: if truncate_from_right:
...: truncated_or_extended = sequence[:desired_length]
...: else:
...: truncated_or_extended = sequence[-desired_length:]
...: if len(truncated_or_extended) < desired_length:
...: truncated_or_extended.extend([default_value()] *
...: (desired_length -
...: len(truncated_or_extended)))
...: return truncated_or_extended
...:
In [4]: %timeit old_pad_sequence_to_length([0]*50, 700)
1000 loops, best of 3: 239 µs per loop
In [5]: %timeit new_pad_sequence_to_length([0]*50, 700)
100000 loops, best of 3: 7.66 µs per loop
In [6]: %timeit old_pad_sequence_to_length([0]*300, 700)
1000 loops, best of 3: 241 µs per loop
In [7]: %timeit new_pad_sequence_to_length([0]*300, 700)
100000 loops, best of 3: 8.88 µs per loop
so, looks like it's ~30 times faster to do it this way...ran this on a p2.xlarge instance, since that's what we're running the models on anyway.
hmm, perhaps the method doesn't work as I thought it did. it's really odd to me that setting the truncation mode would affect the end to which padding is added --- is this intended? (this is the current behavior):
In [4]: pad_sequence_to_length([1,2,3,4,5], 2)
Out[4]: [4, 5]
In [5]: pad_sequence_to_length([1,2,3,4,5], 8)
Out[5]: [0, 0, 0, 1, 2, 3, 4, 5]
In [6]: pad_sequence_to_length([1,2,3,4,5], 8, truncate_from_right=False)
Out[6]: [1, 2, 3, 4, 5, 0, 0, 0]
In [7]: pad_sequence_to_length([1,2,3,4,5], 2, truncate_from_right=False)
Out[7]: [1, 2]
I think it would make a lot more sense to either pad on the left or on the right always, and you could probably make this a constructor param. thoughts? If there's a willful reason behind this, I'd be interested in hearing it.
In addition, i would have thought that truncate_from_right=True
would mean that you would start chopping off the array starting from the right, when in fact the opposite is true...
Regardless, i've modified this PR to have the current behavior. Running the benchmarks again on the new code yields results that are almost the same as before.
also, is there a reason why we wrote our own tokenization / string processing functions? Those are also a major bottleneck...why not use something like NLTK?
oh interesting, just saw this https://github.com/allenai/deep_qa/blob/master/deep_qa/data/tokenizers/word_splitter.py#L20
Does really simple tokenization. NLTK was too slow, so we wrote our own simple tokenizer instead.
looks like spaCy blows them both out of the water (this is on a 569 token string). Perhaps it would be worth trying out a spaCytokenizer? Should be fairly trivial to implement.
In [15]: %timeit [token for token in en_nlp.tokenizer(lorem.lower())]
1000 loops, best of 3: 522 µs per loop
In [16]: %timeit SimpleWordSplitter().split_words(lorem.lower())
100 loops, best of 3: 5.63 ms per loop
In [17]: %timeit NltkWordSplitter().split_words(lorem.lower())
100 loops, best of 3: 12.7 ms per loop
It's been on my TODO for a long time to look into this. Thanks for doing it. I haven't actually looked at the code yet, because it doesn't look like you want a review yet; just responding to your PR comments.
It's actually pretty important to keep the option to pad from either side. Sometimes you expect the most important content to come at the end of a sequence, so if you have to shorten it, you should throw away the beginning (this is true of question text). Other times you want to throw out the end (this is true of news articles in WDW).
And, yeah, feel free to add the spacy tokenizer; looks like it's a faster, better option than what I wrote. I also thought I did a little better than just 2x NLTK, but I guess not... =)
I also think the location of the 0's might impact Keras' LSTM, though I'm not sure about that - Mark might know something more there.
This is ready for a review. SpaCy's tokenizer is a lot faster, but it also is very conservative. E.g. tokenizing Mr. John Smith
will give [Mr., John, Smith]
as expected, but tokenizing mr. john smith
will give [mr, ., john, smith]
. This means that the max lengths can be quite a bit larger than usual. I mentioned this on their repo, hoping to get a response. I think it's a worthy addition anyway, but I didn't set it as the default. if it had the same behavior as NLTK or something i would.
What if you tokenize, then lower? Does it work alright?
yeah, that would work but i'm not sure if we'd lose the speed benefit (which is the reason we're using it in the first place) by going through and lowering everything again. i suppose ill run some benchmarks.
surprisingly, turns out to be almost the same. i suppose i'll do that.
on WDW, the max passage length is 3090 with NLTK / SimpleWordTokenizer and 3539 with SpaCy. hmm...
You might try actually inspecting that passage - could be some rare case that we don't really care about anyway, and we should just be truncating.
Since i have nothing better to do on a sunday night, I figured i might as well look into why it takes so long to process data before we can train (inspired by a conversation with Mark on friday afternoon).
To start off, I refactored the
pad_sequence_to_length
function and made it faster by usinglist.extend
versus repeatedlist.append
. While the amortized runtime oflist.append
isO(1)
, you lose a ton of efficiency by going back and forth from the python / C levels. On the other hand,list.extend
does all of the iteration in C, which makes it a lot faster.I'll also look into refactoring how we make word + character paddings, because that's also torturously slow right now.
Benchmarks will come in a separate comment.