Create an overview of cleaning taggers

KennethEnevoldsen commented 11 months ago

Agreed with @peterbjorgensen that it would be a great idea to create over overview of what taggers might be relevant for cleaning.

Outlining

Create a .md table with relevant taggers + a short description
Check what filters were used for existing cleaning strategies and at least try to match them (see here)
potentially some estimate on speed (time to process danish gigaword Wikipedia section ~55M tokens)

TTTTao725 commented 10 months ago

@KennethEnevoldsen @peterbjorgensen

These are taggers that might be relevant for cleaning:

Dataset: a subset of DAGW which is only consist of documents from 'Wiki & Books'

1 process

#	Dolma Tagger	Description	Relation to Filtering Criteria	Process Time
1	char_length_v1	Computes document length in characters	Relates to character count	30s
2	char_length_with_paragraphs_v1	Computes document and paragraph length in characters	Relates to character count	1m27s
3	cld2_en_doc_v2	Detects document language using cld2	Indirectly relates to language and possibly stopwords	1m44s
4	olmo_pretokenizer_v1	Counts number of tokens using OLMo v1 pre-tokenizer	Relates to token count	8m08s
5	olmo_pretokenizer_with_paragraphs_v1	Counts tokens in document and paragraphs using OLMo v1 pre-tokenizer	Relates to token count	8m39s
6	whitespace_tokenizer_v1	Counts whitespace-separated tokens in document	Relates to token count	1m49s
7	whitespace_tokenizer_with_paragraphs_v1	Counts whitespace-separated tokens in document and paragraphs	Relates to token count	2m03s
8	random_number_v1	Assigns a random number to each document	Facilitates dataset splitting	24s
9	ft_lang_id_en_doc_v2	Uses fastText to detect the language of the document	Indirectly relates to language and possibly stopwords	3m20s
10	ft_lang_id_en_paragraph_v2	Uses fastText to detect the language of each paragraph	Indirectly relates to language and possibly stopwords	6m38s
11	ft_lang_id_en_paragraph_with_doc_score_v2	Uses fastText to detect the language of each paragraph and assigns a score based on the fraction of English paragraphs	Indirectly relates to language and possibly stopwords	5m32s

KennethEnevoldsen commented 10 months ago

Seems like we are missing the gopher filters, PII and c4. Will also add this table as a PR with a a brief introduction on how to use the dolma tagger (it can just be a reference to their documentation)

I would also like to check which existing were ignored (e.g. we discussed stopwords).

Will you also add the taggers implemented in our github (see codebase)

peterbjorgensen commented 10 months ago

Yes, we have these taggers implemented: https://github.com/centre-for-humanities-computing/danish-foundation-models/tree/main/src/dfm/common/data_cleaning/dolma_taggers

Maybe we should also include the remaining Germanic languages (nl, de)

github-actions[bot] commented 10 months ago

This issue is stale because it has been open for 14 days with no activity. Feel free to either 1) remove the stale label or 2) comment. If nothing happens, this will be closed in 7 days.

peterbjorgensen commented 10 months ago

@TTTTao725 How did you measure the processing times of the taggers in the table. Did you create a script to do this?

TTTTao725 commented 10 months ago

@peterbjorgensen No, they have a build-in timer so you check it out right after one execution of a tagger, I'll make a PR these days, I have tested more taggers including ones for scandi languages :)

peterbjorgensen commented 10 months ago

@TTTTao725 Cool, I don't see that anywhere? If I do dolma tag --profile.enable ... I just get the regular cprofile format stats. I am trying to run as many taggers as possible on the hplt dataset, but it's very slow, so I'm trying to pinpoint the slowest ones. It looks like the regex based repetition taggers are quite slow. They are not in your list.

KennethEnevoldsen commented 10 months ago

I believe if you just run default you will get a time at the end. However this seems like a valid reason to use time:

time {command} > {file to save results}.txt 

time sleep 2 > time.txt

TTTTao725 commented 10 months ago

Yes, as Kenneth said:

And in case you need stats of more taggers:

#	Dolma Tagger	Description	Process Time (In total, Speed)
1	char_length_v1	Computes document length in characters	16s, 16.2kd/s
2	char_length_with_paragraphs_v1	Computes document and paragraph length in characters	49s, 5.40kd/s
3	cld2_en_doc_v2	Detects document language using cld2	56s, 4.76kd/s
4	olmo_pretokenizer_v1	Counts number of tokens using OLMo v1 pre-tokenizer	6m57s, 645d/s
5	olmo_pretokenizer_with_paragraphs_v1	Counts tokens in document and paragraphs using OLMo v1 pre-tokenizer	7m02s, 636d/s
6	whitespace_tokenizer_v1	Counts whitespace-separated tokens in document	1m00s, 4.47kd/s
7	whitespace_tokenizer_with_paragraphs_v1	Counts whitespace-separated tokens in document and paragraphs	1m39s, 2.70kd/s
8	random_number_v1	Assigns a random number to each document	17s, 15.6kd/s
9	ft_lang_id_en_doc_v2	Uses fastText to detect the language of the document	2m28s, 1.82kd/s
10	ft_lang_id_en_paragraph_v2	Uses fastText to detect the language of each paragraph	6m21s, 705d/s
11	ft_lang_id_en_paragraph_with_doc_score_v2	Uses fastText to detect the language of each paragraph and assigns a score based on the fraction of English paragraphs	6m16s, 715d/s
12	gopher_v1	Tags spans of documents matching Deepmind's Gopher removal rules	15m49s, 283d/s
13	c4_v1	Implements taggers used to generate the C4 dataset	3m50s, 1.17kd/s
14	c4_v2	Faster implementation of the C4 taggers	2m08s, 2.10kd/s
15	pii_presidio_v1	Tags spans of documents that contain personally identifiable information (PII) using the Presidio Analyzer library	way to slow: about 7s per document. However `analyzer_results` in pii.py defines the language as English if . See line 110 in here
16	pii_regex_v1	Tags spans of documents that contain personally identifiable information (PII) using a set of regular expressions	2m55s, 1.53kd/s
17	pii_regex_v2	Faster implementation of `pii_regex_v1`	2m51s, 1.57kd/s
18	pii_regex_with_counts_v2	Tags spans of documents that contain personally identifiable information (PII) using a set of regular expressions. It also counts the number of matches for each regular expression	2m43s, 1.65kd/s
19	pii_regex_with_counts_fast_v2	Faster implementation of `pii_regex_with_counts_v2`	1m01s, 4.36kd/s
20	cld2_scandi_doc	Language Detection using cld2	1m11s, 3.79kd/s
21	cld2_scandi_paragraph	Language Detection on paragraph level using cld2	5m59s, 748d/s
22	ft_lang_id_scandi_doc	FastText Language Detection	3m14s, 1.38kd/s
23	ft_lang_id_scandi_paragraph	FastText Language Detection on paragraph level	14m06s, 318d/s
24	cld2_scandi_paragraph_with_doc_score	Language Detection on paragraph level with a total score using cld2	8m04s, 556d/s
25	ft_lang_id_scandi_paragraph_with_doc_score	FastText Language Detection on paragraph level with a total score	14m37s, 306d/s
26	jigsaw_hatespeech_document_v2	Tags documents as containing hate speech or not using a FastText classifier trained on the Jigsaw hate speech dataset.	1m38s, 2.74kd/s
27	jigsaw_hatespeech_sentence_v2	Tags spans of documents as containing hate speech or not using a FastText classifier trained on the Jigsaw hate speech dataset.	9m45s, 460d/s
28	jigsaw_nsfw_document_v1	Tags documents as containing NSFW content or not using a FastText classifier trained on the Jigsaw NSFW dataset.	6m40s, 671d/s
29	jigsaw_nsfw_sentence_v2	Tags spans of documents as containing NSFW content or not using a FastText classifier trained on the Jigsaw NSFW dataset.	9m02s, 496d/s

KennethEnevoldsen commented 10 months ago

Thanks @TTTTao725, if you have the time for the PR one of the following days that would be great to get it merged in such that you are not sitting with multiple tasks

peterbjorgensen commented 10 months ago

If I understand what you are saying correctly, you are running the tagger with only one tagger at a time and recording the time it took to run that tagger? I only get a single number for the time it took for every dolma tag run.

I wrote a small bash loop to do this and here are some numbers I got:

c4_v2
documents: 1.00kd [00:01, 530d/s]

ccnet_perplexity_paragraph_w_doc_da
documents: 1.00kd [00:03, 271d/s]

ccnet_perplexity_paragraph_w_doc_en
documents: 1.00kd [00:03, 253d/s]

char_length_strip_ws_v1
documents: 1.00kd [00:00, 2.83kd/s]

char_length_with_paragraphs_v1
documents: 1.00kd [00:01, 729d/s]

cld2_en_paragraph_with_doc_score_v2
documents: 1.00kd [00:03, 268d/s]

cld2_scandi_paragraph_with_doc_score
documents: 1.00kd [00:10, 94.4d/s]
documents: 1.00kd [00:10, 218d/s]

code_copyright_comments_v1
documents: 1.00kd [00:00, 2.27kd/s]

code_redpajama_taggers_v1
documents: 1.00kd [00:01, 678d/s]

code_secrets_v1
documents: 1.00kd [00:40, 24.8d/s]

code_starcoder_taggers_v1
documents: 1.00kd [00:00, 2.55kd/s]

code_starcoder_taggers_v2
documents: 1.00kd [00:00, 2.19kd/s]

ft_lang_id_scandi_paragraph_with_doc_score
documents: 1.00kd [00:17, 57.0d/s]

jigsaw_hatespeech_document_v2
documents: 1.00kd [00:02, 405d/s]

jigsaw_hatespeech_sentence_v2
documents: 1.00kd [00:04, 241d/s]

jigsaw_nsfw_document_v1
documents: 1.00kd [00:02, 410d/s]

jigsaw_nsfw_sencence_v2
documents: 1.00kd [00:04, 226d/s]

not_alphanum_paragraph_v1
documents: 1.00kd [00:00, 1.58kd/s]

olmo_pretokenizer_with_paragraphs_v1
documents: 1.00kd [00:04, 230d/s]

paragraph_repetitions_v1
documents: 1.00kd [01:07, 14.8d/s]

paragraph_tokenizer_repetitions_v1
documents: 1.00kd [00:19, 51.3d/s]

pii_presidio_v1
documents: 1.00kd [01:45, 9.44d/s]

pii_regex_with_counts_fast_v2
documents: 1.00kd [00:00, 1.07kd/s]

random_number_v1
documents: 1.00kd [00:00, 2.92kd/s]

repetitions_v1
documents: 1.00kd [01:07, 14.7d/s]

tokenizer_repetitions_v1
documents: 1.00kd [00:05, 173d/s]

tokenizers_AI2_OLMo_v1
documents: 1.00kd [00:03, 271d/s]

tokenizers_EleutherAI_GPT_NeoX_20B
documents: 1.00kd [00:03, 260d/s]

uniseg_length_paragraphs_with_doc_length_v1
documents: 1.00kd [00:09, 105d/s]

whitespace_tokenizer_with_paragraphs_v1
documents: 1.00kd [00:02, 455d/s]

I think I should remove all the taggers that took more than 10 seconds in my example (1000 documents only), maybe except for fasttext lang id.

KennethEnevoldsen commented 10 months ago

^yep exactly, we just did it to get an overview of what taggers were to slow to run in practice.

peterbjorgensen commented 9 months ago

A small update. I split the hplt dataset into 24 files and run the taggers I mentioned above. It has processed 9 files but the tagger seems to have frozen. I guess it is stuck in some cpu loop, because it is still running at 100% at many of the cores, but haven't written anything to the attributes files for two days. I am not sure which taggers are causing this. Some of the attributes files ends up being around 10 times larger than the actual data files.

peterbjorgensen commented 9 months ago

Another update: I have run the majority of the taggers one-by-one instead of a single run and it seems that the tagger not_alphanum_paragraph_v1 chokes on some of the text in the hplt dataset. It doesn't crash, but it just stops writing to the attributes files and keep running at 100% cpu. The tagger itself looks quite simple, but it could be that some of the regex queries it implements are extremely slow on some corner case text examples.

KennethEnevoldsen commented 9 months ago

This seems very odd. The function is quite simple. Can you identify the examples by just running the python implementation?

peterbjorgensen commented 9 months ago

I agree it's very odd. But it works when I exclude this tagger from the set. The command line tool is pure python when used for tagging. I can make a minimum working example to find the data examples it chokes on.

rlrs commented 9 months ago

The regex isn't really that simple, it looks like it could definitely involve a lot of backtracking and probably take forever on some edge case. I think that's likely what happens, can't you find the document it happens on?

peterbjorgensen commented 9 months ago

The regex isn't really that simple, it looks like it could definitely involve a lot of backtracking and probably take forever on some edge case. I think that's likely what happens, can't you find the document it happens on?

It looks like long sequences of emojis stalls the tagger forever

InputSpec(id='7', text='😠 😡', source='hplt1.2', version=None) took 0.000039 seconds

InputSpec(id='4', text='😠 😡 😤 😋 😎 🌦 🌧 🌜 🌈 🏝 🎅\n\nAnti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None) took 0.000025 seconds

InputSpec(id='11', text='😠 😡 😤 😋 😎 😴 😈 😇 😕 😏 😑 👲 👮 💂 👶 ❤ 💔 💕 💘 💌 💋 🎁 💰 💍 👍 👎 👌 ✌️ 🤘 👏 🎵 ☕️ 🍵 Anti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None) took 0.000021 seconds

InputSpec(id='5', text='😠 😡 😤 😋 😎 😴 😈 😇 😕 😏 😑 👲 👮 💂 👶 ❤ 💔 💕 💘 💌 💋 🎁 💰 💍 👍 👎 👌 ✌️ 🤘 👏 🎵 ☕️ 🍵 🍺 🍷 🍼 ☀️ 🌤 🌦 🌧 🌜 🌈 🏝 🎅\n\nAnti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None) took 64.204857 seconds

InputSpec(id='3', text='\nGæstebogs indlæg: \n😄 😃 😊 😉 😍 😚 😗 😜 😛 😳 😁 😬 😌 😞 😢 😂 😭 😅 😓 😩 😮 😱 😠 😡 😤 😋 😎 😴 😈 😇 😕 😏 😑 👲 👮 💂 👶 ❤ 💔 💕 💘 💌 💋 🎁 💰 💍 👍 👎 👌 ✌️ 🤘 👏 🎵 ☕️ 🍵 🍺 🍷 🍼 ☀️ 🌤 🌦 🌧 🌜 🌈 🏝 🎅\n\nAnti-Spam: \nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None) ... takes 'forever'

KennethEnevoldsen commented 9 months ago

Hmm that is odd a solution might be to wrap it in timeout:

https://stackoverflow.com/questions/492519/timeout-on-a-function-call

then simply have it return NaN?

peterbjorgensen commented 9 months ago

I think it would be better to either fix this tagger or add one that works. I don't understand why it is so slow. It basically checks if the string has any alphanumeric character in it and if not it checks if all the characters are punctuation. Seems like a bug in the regex implementation. I am not sure why they use a python package called regex instead of the one in the standard libraries re. If I swap the regex package with the standard re one it works with no problems. I will report the bug upstream.

centre-for-humanities-computing / danish-foundation-models

Create an overview of cleaning taggers #207

Outlining