Open KennethEnevoldsen opened 11 months ago
@KennethEnevoldsen @peterbjorgensen
These are taggers that might be relevant for cleaning:
- Dataset: a subset of DAGW which is only consist of documents from
'Wiki & Books'
- 1 process
# | Dolma Tagger | Description | Relation to Filtering Criteria | Process Time |
---|---|---|---|---|
1 | char_length_v1 | Computes document length in characters | Relates to character count | 30s |
2 | char_length_with_paragraphs_v1 | Computes document and paragraph length in characters | Relates to character count | 1m27s |
3 | cld2_en_doc_v2 | Detects document language using cld2 | Indirectly relates to language and possibly stopwords | 1m44s |
4 | olmo_pretokenizer_v1 | Counts number of tokens using OLMo v1 pre-tokenizer | Relates to token count | 8m08s |
5 | olmo_pretokenizer_with_paragraphs_v1 | Counts tokens in document and paragraphs using OLMo v1 pre-tokenizer | Relates to token count | 8m39s |
6 | whitespace_tokenizer_v1 | Counts whitespace-separated tokens in document | Relates to token count | 1m49s |
7 | whitespace_tokenizer_with_paragraphs_v1 | Counts whitespace-separated tokens in document and paragraphs | Relates to token count | 2m03s |
8 | random_number_v1 | Assigns a random number to each document | Facilitates dataset splitting | 24s |
9 | ft_lang_id_en_doc_v2 | Uses fastText to detect the language of the document | Indirectly relates to language and possibly stopwords | 3m20s |
10 | ft_lang_id_en_paragraph_v2 | Uses fastText to detect the language of each paragraph | Indirectly relates to language and possibly stopwords | 6m38s |
11 | ft_lang_id_en_paragraph_with_doc_score_v2 | Uses fastText to detect the language of each paragraph and assigns a score based on the fraction of English paragraphs | Indirectly relates to language and possibly stopwords | 5m32s |
Seems like we are missing the gopher filters, PII and c4. Will also add this table as a PR with a a brief introduction on how to use the dolma tagger (it can just be a reference to their documentation)
I would also like to check which existing were ignored (e.g. we discussed stopwords).
Will you also add the taggers implemented in our github (see codebase)
Yes, we have these taggers implemented: https://github.com/centre-for-humanities-computing/danish-foundation-models/tree/main/src/dfm/common/data_cleaning/dolma_taggers
Maybe we should also include the remaining Germanic languages (nl, de)
This issue is stale because it has been open for 14 days with no activity. Feel free to either 1) remove the stale label or 2) comment. If nothing happens, this will be closed in 7 days.
@TTTTao725 How did you measure the processing times of the taggers in the table. Did you create a script to do this?
@peterbjorgensen No, they have a build-in timer so you check it out right after one execution of a tagger, I'll make a PR these days, I have tested more taggers including ones for scandi languages :)
@TTTTao725 Cool, I don't see that anywhere? If I do dolma tag --profile.enable ...
I just get the regular cprofile format stats. I am trying to run as many taggers as possible on the hplt dataset, but it's very slow, so I'm trying to pinpoint the slowest ones. It looks like the regex based repetition taggers are quite slow. They are not in your list.
I believe if you just run default you will get a time at the end. However this seems like a valid reason to use time:
time {command} > {file to save results}.txt
time sleep 2 > time.txt
Yes, as Kenneth said:
And in case you need stats of more taggers:
# | Dolma Tagger | Description | Process Time (In total, Speed) |
---|---|---|---|
1 | char_length_v1 | Computes document length in characters | 16s, 16.2kd/s |
2 | char_length_with_paragraphs_v1 | Computes document and paragraph length in characters | 49s, 5.40kd/s |
3 | cld2_en_doc_v2 | Detects document language using cld2 | 56s, 4.76kd/s |
4 | olmo_pretokenizer_v1 | Counts number of tokens using OLMo v1 pre-tokenizer | 6m57s, 645d/s |
5 | olmo_pretokenizer_with_paragraphs_v1 | Counts tokens in document and paragraphs using OLMo v1 pre-tokenizer | 7m02s, 636d/s |
6 | whitespace_tokenizer_v1 | Counts whitespace-separated tokens in document | 1m00s, 4.47kd/s |
7 | whitespace_tokenizer_with_paragraphs_v1 | Counts whitespace-separated tokens in document and paragraphs | 1m39s, 2.70kd/s |
8 | random_number_v1 | Assigns a random number to each document | 17s, 15.6kd/s |
9 | ft_lang_id_en_doc_v2 | Uses fastText to detect the language of the document | 2m28s, 1.82kd/s |
10 | ft_lang_id_en_paragraph_v2 | Uses fastText to detect the language of each paragraph | 6m21s, 705d/s |
11 | ft_lang_id_en_paragraph_with_doc_score_v2 | Uses fastText to detect the language of each paragraph and assigns a score based on the fraction of English paragraphs | 6m16s, 715d/s |
12 | gopher_v1 | Tags spans of documents matching Deepmind's Gopher removal rules | 15m49s, 283d/s |
13 | c4_v1 | Implements taggers used to generate the C4 dataset | 3m50s, 1.17kd/s |
14 | c4_v2 | Faster implementation of the C4 taggers | 2m08s, 2.10kd/s |
15 | pii_presidio_v1 | Tags spans of documents that contain personally identifiable information (PII) using the Presidio Analyzer library | way to slow: about 7s per document. However analyzer_results in pii.py defines the language as English if . See line 110 in here |
16 | pii_regex_v1 | Tags spans of documents that contain personally identifiable information (PII) using a set of regular expressions | 2m55s, 1.53kd/s |
17 | pii_regex_v2 | Faster implementation of pii_regex_v1 |
2m51s, 1.57kd/s |
18 | pii_regex_with_counts_v2 | Tags spans of documents that contain personally identifiable information (PII) using a set of regular expressions. It also counts the number of matches for each regular expression | 2m43s, 1.65kd/s |
19 | pii_regex_with_counts_fast_v2 | Faster implementation of pii_regex_with_counts_v2 |
1m01s, 4.36kd/s |
20 | cld2_scandi_doc | Language Detection using cld2 | 1m11s, 3.79kd/s |
21 | cld2_scandi_paragraph | Language Detection on paragraph level using cld2 | 5m59s, 748d/s |
22 | ft_lang_id_scandi_doc | FastText Language Detection | 3m14s, 1.38kd/s |
23 | ft_lang_id_scandi_paragraph | FastText Language Detection on paragraph level | 14m06s, 318d/s |
24 | cld2_scandi_paragraph_with_doc_score | Language Detection on paragraph level with a total score using cld2 | 8m04s, 556d/s |
25 | ft_lang_id_scandi_paragraph_with_doc_score | FastText Language Detection on paragraph level with a total score | 14m37s, 306d/s |
26 | jigsaw_hatespeech_document_v2 | Tags documents as containing hate speech or not using a FastText classifier trained on the Jigsaw hate speech dataset. | 1m38s, 2.74kd/s |
27 | jigsaw_hatespeech_sentence_v2 | Tags spans of documents as containing hate speech or not using a FastText classifier trained on the Jigsaw hate speech dataset. | 9m45s, 460d/s |
28 | jigsaw_nsfw_document_v1 | Tags documents as containing NSFW content or not using a FastText classifier trained on the Jigsaw NSFW dataset. | 6m40s, 671d/s |
29 | jigsaw_nsfw_sentence_v2 | Tags spans of documents as containing NSFW content or not using a FastText classifier trained on the Jigsaw NSFW dataset. | 9m02s, 496d/s |
Thanks @TTTTao725, if you have the time for the PR one of the following days that would be great to get it merged in such that you are not sitting with multiple tasks
If I understand what you are saying correctly, you are running the tagger with only one tagger at a time and recording the time it took to run that tagger?
I only get a single number for the time it took for every dolma tag
run.
I wrote a small bash loop to do this and here are some numbers I got:
c4_v2
documents: 1.00kd [00:01, 530d/s]
ccnet_perplexity_paragraph_w_doc_da
documents: 1.00kd [00:03, 271d/s]
ccnet_perplexity_paragraph_w_doc_en
documents: 1.00kd [00:03, 253d/s]
char_length_strip_ws_v1
documents: 1.00kd [00:00, 2.83kd/s]
char_length_with_paragraphs_v1
documents: 1.00kd [00:01, 729d/s]
cld2_en_paragraph_with_doc_score_v2
documents: 1.00kd [00:03, 268d/s]
cld2_scandi_paragraph_with_doc_score
documents: 1.00kd [00:10, 94.4d/s]
documents: 1.00kd [00:10, 218d/s]
code_copyright_comments_v1
documents: 1.00kd [00:00, 2.27kd/s]
code_redpajama_taggers_v1
documents: 1.00kd [00:01, 678d/s]
code_secrets_v1
documents: 1.00kd [00:40, 24.8d/s]
code_starcoder_taggers_v1
documents: 1.00kd [00:00, 2.55kd/s]
code_starcoder_taggers_v2
documents: 1.00kd [00:00, 2.19kd/s]
ft_lang_id_scandi_paragraph_with_doc_score
documents: 1.00kd [00:17, 57.0d/s]
jigsaw_hatespeech_document_v2
documents: 1.00kd [00:02, 405d/s]
jigsaw_hatespeech_sentence_v2
documents: 1.00kd [00:04, 241d/s]
jigsaw_nsfw_document_v1
documents: 1.00kd [00:02, 410d/s]
jigsaw_nsfw_sencence_v2
documents: 1.00kd [00:04, 226d/s]
not_alphanum_paragraph_v1
documents: 1.00kd [00:00, 1.58kd/s]
olmo_pretokenizer_with_paragraphs_v1
documents: 1.00kd [00:04, 230d/s]
paragraph_repetitions_v1
documents: 1.00kd [01:07, 14.8d/s]
paragraph_tokenizer_repetitions_v1
documents: 1.00kd [00:19, 51.3d/s]
pii_presidio_v1
documents: 1.00kd [01:45, 9.44d/s]
pii_regex_with_counts_fast_v2
documents: 1.00kd [00:00, 1.07kd/s]
random_number_v1
documents: 1.00kd [00:00, 2.92kd/s]
repetitions_v1
documents: 1.00kd [01:07, 14.7d/s]
tokenizer_repetitions_v1
documents: 1.00kd [00:05, 173d/s]
tokenizers_AI2_OLMo_v1
documents: 1.00kd [00:03, 271d/s]
tokenizers_EleutherAI_GPT_NeoX_20B
documents: 1.00kd [00:03, 260d/s]
uniseg_length_paragraphs_with_doc_length_v1
documents: 1.00kd [00:09, 105d/s]
whitespace_tokenizer_with_paragraphs_v1
documents: 1.00kd [00:02, 455d/s]
I think I should remove all the taggers that took more than 10 seconds in my example (1000 documents only), maybe except for fasttext lang id.
^yep exactly, we just did it to get an overview of what taggers were to slow to run in practice.
A small update. I split the hplt dataset into 24 files and run the taggers I mentioned above. It has processed 9 files but the tagger seems to have frozen. I guess it is stuck in some cpu loop, because it is still running at 100% at many of the cores, but haven't written anything to the attributes files for two days. I am not sure which taggers are causing this. Some of the attributes files ends up being around 10 times larger than the actual data files.
Another update:
I have run the majority of the taggers one-by-one instead of a single run and it seems that the tagger not_alphanum_paragraph_v1
chokes on some of the text in the hplt
dataset. It doesn't crash, but it just stops writing to the attributes files and keep running at 100% cpu. The tagger itself looks quite simple, but it could be that some of the regex queries it implements are extremely slow on some corner case text examples.
This seems very odd. The function is quite simple. Can you identify the examples by just running the python implementation?
I agree it's very odd. But it works when I exclude this tagger from the set. The command line tool is pure python when used for tagging. I can make a minimum working example to find the data examples it chokes on.
The regex isn't really that simple, it looks like it could definitely involve a lot of backtracking and probably take forever on some edge case. I think that's likely what happens, can't you find the document it happens on?
The regex isn't really that simple, it looks like it could definitely involve a lot of backtracking and probably take forever on some edge case. I think that's likely what happens, can't you find the document it happens on?
It looks like long sequences of emojis stalls the tagger forever
InputSpec(id='7', text='😠 😡', source='hplt1.2', version=None) took 0.000039 seconds
InputSpec(id='4', text='😠 😡 😤 😋 😎 🌦 🌧 🌜 🌈 🏝 🎅\n\nAnti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None) took 0.000025 seconds
InputSpec(id='11', text='😠 😡 😤 😋 😎 😴 😈 😇 😕 😏 😑 👲 👮 💂 👶 ❤ 💔 💕 💘 💌 💋 🎁 💰 💍 👍 👎 👌 ✌️ 🤘 👏 🎵 ☕️ 🍵 Anti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None) took 0.000021 seconds
InputSpec(id='5', text='😠 😡 😤 😋 😎 😴 😈 😇 😕 😏 😑 👲 👮 💂 👶 ❤ 💔 💕 💘 💌 💋 🎁 💰 💍 👍 👎 👌 ✌️ 🤘 👏 🎵 ☕️ 🍵 🍺 🍷 🍼 ☀️ 🌤 🌦 🌧 🌜 🌈 🏝 🎅\n\nAnti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None) took 64.204857 seconds
InputSpec(id='3', text='\nGæstebogs indlæg: \n😄 😃 😊 😉 😍 😚 😗 😜 😛 😳 😁 😬 😌 😞 😢 😂 😭 😅 😓 😩 😮 😱 😠 😡 😤 😋 😎 😴 😈 😇 😕 😏 😑 👲 👮 💂 👶 ❤ 💔 💕 💘 💌 💋 🎁 💰 💍 👍 👎 👌 ✌️ 🤘 👏 🎵 ☕️ 🍵 🍺 🍷 🍼 ☀️ 🌤 🌦 🌧 🌜 🌈 🏝 🎅\n\nAnti-Spam: \nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None) ... takes 'forever'
Hmm that is odd a solution might be to wrap it in timeout:
https://stackoverflow.com/questions/492519/timeout-on-a-function-call
then simply have it return NaN?
I think it would be better to either fix this tagger or add one that works. I don't understand why it is so slow. It basically checks if the string has any alphanumeric character in it and if not it checks if all the characters are punctuation. Seems like a bug in the regex
implementation. I am not sure why they use a python package called regex
instead of the one in the standard libraries re
. If I swap the regex
package with the standard re
one it works with no problems. I will report the bug upstream.
Agreed with @peterbjorgensen that it would be a great idea to create over overview of what taggers might be relevant for cleaning.
Outlining