allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
909 stars 94 forks source link

Add robust median to gopher filter #98

Closed KennethEnevoldsen closed 7 months ago

KennethEnevoldsen commented 8 months ago

This pull request adds a robust median function to the gopher filter, which would otherwise fails on empty docs such as "" or " ".

Sorry for the multiple commit wanted to add @TTTTao725 as a co-author as he found the bug.

KennethEnevoldsen commented 8 months ago

Updated in accordance with failing tests

soldni commented 8 months ago

tnx!! added a comment, looks good otherwise.

soldni commented 8 months ago

for the python style error, you should be able to run make style to fix any issue :)

KennethEnevoldsen commented 7 months ago

Hi @soldni fixed the style errors!

KennethEnevoldsen commented 7 months ago

@soldni it seems like this is awaiting an approval for the tests to be run

soldni commented 7 months ago

Hey @KennethEnevoldsen! I've made a few more changes:

I've approved this PR, lmk if it looks good to you too before I merge!

KennethEnevoldsen commented 7 months ago

Thanks for taking care of that. Everything looks good to me