Reddit dataset seems to not be showing the full comment

tonyhqanguyen commented 5 years ago

I had finished downloading the reddit data and I can view them using the tfrutil.py module provided pretty nicely. The problem I'm encountering is that a lot of the comments seem to be cut off abruptly. For example:

Example 3

[Context]: Quite a bit according to reddits own gold statistics. [Response]: I fucking clicked that without even checking the URL first. Goddamn it. You win the day fine sir.

Extra Contexts: [context/0]: I would like to know how much was spent on Reddit gold for people posting a Rick Roll. I bet it's way more than $12 (though to

Example 6

[Context]: Rather you're rewarded for planning ahead because you don't have a crutch "save me" button. [Response]: Not like DBM will tell you when things are coming..

Extra Contexts: [context/7]: Warriors on suicide watch \n \n (I don't mind the gcd change on hunters actually. And the two ranged specs seem decent. Survival [context/6]: The Disengage on GCD is worse than any of the warriors spells.\n \n People made up stupid builds to show how you could go at [context/5]: I dunno, I got used to the disengage change pretty quick. It was annoying at first, mostly because you feel forced to stop [context/4]: Defensive abilities should be reactive and putting them locked because you're doing something else hurts the gameplay.\n \n [context/3]: To be fair Disengage isn't a purely defensive spell. I use disengage offensively all the time as a survival hunter, even as MM [context/2]: But you also use it to dodge problematic stuff while you're DPSing. [context/1]: Which is why it has reduced gcd [context/0]: But you're still penalized for playing to the best of your class.

As you can see, the comments seem to just be cut off like "(though to" and ". Survival", etc. As I'm using this data for what's intended for -- conversations, this could be pretty detrimental to the performance of the model since the continuation doesn't make sense. Any thoughts on this?

Thanks

edit: These examples are pulled from test-00001-of-00100.tfrecords if it helps. I don't know if your train-test splitting is random for everyone.

matthen commented 5 years ago

Thanks for the interest! This is expected. From the Reddit README:

Further back contexts, from the comment's parent's parent etc., are stored as extra context features. Their texts are trimmed to be at most 128 characters in length, without splitting apart words. This helps to bound the size of an individual example.

The response and context features are filtered to guarantee a maximum character length, but the further back ones are trimmed to guarantee the maximum length. This means the context and response are always full comments but further back ones might not be.

You can tweak this using the max_length flag, which also controls the filtering for the response and context features. You could add a flag to do no trimming. But note any changes to the defaults would make comparisons to the benchmarks invalid.

And the train test split is the same for everyone, though the sharding isn't.

tonyhqanguyen commented 5 years ago

I see. I'm just curious: were the extra contexts used to get the benchmarks then?

matthen commented 5 years ago

From BENCHMARKS.md

All the results are for models using only the context feature to select the correct response. Models using extra contexts are not reported here (yet).

PolyAI-LDN / conversational-datasets

Reddit dataset seems to not be showing the full comment #45

Example 3

Example 6