allenai / dolma

Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
909 stars 94 forks source link

Only the attributes written by the last tagger in the tagger list gets written in version 1.0.0 #113

Closed peterbjorgensen closed 7 months ago

peterbjorgensen commented 7 months ago

Since upgrading dolma to version 1.0.0 I only get the attributes from the last tagger in the list. I think the problem is here: https://github.com/allenai/dolma/blob/a74b78ac531e06adb61bf70986c8d2a3ef38e9d7/python/dolma/core/runtime.py#L198-L200 tagger_output.path is the same for all the taggers in the list, but attributes_by_stream[tagger_output.path] will be set to empty dictionary when looping through the taggers, leaving only the attributes from the last tagger in the list. This bug is not present in version 0.9.4. I would submit a pull request, but I am not sure what these three lines are supposed to fix.