Closed tonyhqanguyen closed 5 years ago
Thanks for the interest! This is expected. From the Reddit README:
Further back contexts, from the comment's parent's parent etc., are stored as extra context features. Their texts are trimmed to be at most 128 characters in length, without splitting apart words. This helps to bound the size of an individual example.
The response and context features are filtered to guarantee a maximum character length, but the further back ones are trimmed to guarantee the maximum length. This means the context and response are always full comments but further back ones might not be.
You can tweak this using the max_length flag, which also controls the filtering for the response and context features. You could add a flag to do no trimming. But note any changes to the defaults would make comparisons to the benchmarks invalid.
And the train test split is the same for everyone, though the sharding isn't.
I see. I'm just curious: were the extra contexts used to get the benchmarks then?
From BENCHMARKS.md
All the results are for models using only the context feature to select the correct response. Models using extra contexts are not reported here (yet).
I had finished downloading the reddit data and I can view them using the tfrutil.py module provided pretty nicely. The problem I'm encountering is that a lot of the comments seem to be cut off abruptly. For example:
As you can see, the comments seem to just be cut off like "(though to" and ". Survival", etc. As I'm using this data for what's intended for -- conversations, this could be pretty detrimental to the performance of the model since the continuation doesn't make sense. Any thoughts on this?
Thanks
edit: These examples are pulled from
test-00001-of-00100.tfrecords
if it helps. I don't know if your train-test splitting is random for everyone.