centre-for-humanities-computing / danish-foundation-models

A project for training foundational Danish language model
https://foundationmodels.dk
MIT License
68 stars 4 forks source link

Create an overview of cleaning taggers #207

Open KennethEnevoldsen opened 11 months ago

KennethEnevoldsen commented 11 months ago

Agreed with @peterbjorgensen that it would be a great idea to create over overview of what taggers might be relevant for cleaning.

Outlining

TTTTao725 commented 10 months ago

@KennethEnevoldsen @peterbjorgensen

These are taggers that might be relevant for cleaning:

  • Dataset: a subset of DAGW which is only consist of documents from 'Wiki & Books'
  • 1 process
image
# Dolma Tagger Description Relation to Filtering Criteria Process Time
1 char_length_v1 Computes document length in characters Relates to character count 30s
2 char_length_with_paragraphs_v1 Computes document and paragraph length in characters Relates to character count 1m27s
3 cld2_en_doc_v2 Detects document language using cld2 Indirectly relates to language and possibly stopwords 1m44s
4 olmo_pretokenizer_v1 Counts number of tokens using OLMo v1 pre-tokenizer Relates to token count 8m08s
5 olmo_pretokenizer_with_paragraphs_v1 Counts tokens in document and paragraphs using OLMo v1 pre-tokenizer Relates to token count 8m39s
6 whitespace_tokenizer_v1 Counts whitespace-separated tokens in document Relates to token count 1m49s
7 whitespace_tokenizer_with_paragraphs_v1 Counts whitespace-separated tokens in document and paragraphs Relates to token count 2m03s
8 random_number_v1 Assigns a random number to each document Facilitates dataset splitting 24s
9 ft_lang_id_en_doc_v2 Uses fastText to detect the language of the document Indirectly relates to language and possibly stopwords 3m20s
10 ft_lang_id_en_paragraph_v2 Uses fastText to detect the language of each paragraph Indirectly relates to language and possibly stopwords 6m38s
11 ft_lang_id_en_paragraph_with_doc_score_v2 Uses fastText to detect the language of each paragraph and assigns a score based on the fraction of English paragraphs Indirectly relates to language and possibly stopwords 5m32s
KennethEnevoldsen commented 10 months ago

Seems like we are missing the gopher filters, PII and c4. Will also add this table as a PR with a a brief introduction on how to use the dolma tagger (it can just be a reference to their documentation)

I would also like to check which existing were ignored (e.g. we discussed stopwords).

Will you also add the taggers implemented in our github (see codebase)

peterbjorgensen commented 10 months ago

Yes, we have these taggers implemented: https://github.com/centre-for-humanities-computing/danish-foundation-models/tree/main/src/dfm/common/data_cleaning/dolma_taggers

Maybe we should also include the remaining Germanic languages (nl, de)

github-actions[bot] commented 10 months ago

This issue is stale because it has been open for 14 days with no activity. Feel free to either 1) remove the stale label or 2) comment. If nothing happens, this will be closed in 7 days.

peterbjorgensen commented 10 months ago

@TTTTao725 How did you measure the processing times of the taggers in the table. Did you create a script to do this?

TTTTao725 commented 10 months ago

@peterbjorgensen No, they have a build-in timer so you check it out right after one execution of a tagger, I'll make a PR these days, I have tested more taggers including ones for scandi languages :)

peterbjorgensen commented 10 months ago

@TTTTao725 Cool, I don't see that anywhere? If I do dolma tag --profile.enable ... I just get the regular cprofile format stats. I am trying to run as many taggers as possible on the hplt dataset, but it's very slow, so I'm trying to pinpoint the slowest ones. It looks like the regex based repetition taggers are quite slow. They are not in your list.

KennethEnevoldsen commented 10 months ago

I believe if you just run default you will get a time at the end. However this seems like a valid reason to use time:

time {command} > {file to save results}.txt 

time sleep 2 > time.txt 
TTTTao725 commented 10 months ago

Yes, as Kenneth said:

image

And in case you need stats of more taggers:

# Dolma Tagger Description Process Time (In total, Speed)
1 char_length_v1 Computes document length in characters 16s, 16.2kd/s
2 char_length_with_paragraphs_v1 Computes document and paragraph length in characters 49s, 5.40kd/s
3 cld2_en_doc_v2 Detects document language using cld2 56s, 4.76kd/s
4 olmo_pretokenizer_v1 Counts number of tokens using OLMo v1 pre-tokenizer 6m57s, 645d/s
5 olmo_pretokenizer_with_paragraphs_v1 Counts tokens in document and paragraphs using OLMo v1 pre-tokenizer 7m02s, 636d/s
6 whitespace_tokenizer_v1 Counts whitespace-separated tokens in document 1m00s, 4.47kd/s
7 whitespace_tokenizer_with_paragraphs_v1 Counts whitespace-separated tokens in document and paragraphs 1m39s, 2.70kd/s
8 random_number_v1 Assigns a random number to each document 17s, 15.6kd/s
9 ft_lang_id_en_doc_v2 Uses fastText to detect the language of the document 2m28s, 1.82kd/s
10 ft_lang_id_en_paragraph_v2 Uses fastText to detect the language of each paragraph 6m21s, 705d/s
11 ft_lang_id_en_paragraph_with_doc_score_v2 Uses fastText to detect the language of each paragraph and assigns a score based on the fraction of English paragraphs 6m16s, 715d/s
12 gopher_v1 Tags spans of documents matching Deepmind's Gopher removal rules 15m49s, 283d/s
13 c4_v1 Implements taggers used to generate the C4 dataset 3m50s, 1.17kd/s
14 c4_v2 Faster implementation of the C4 taggers 2m08s, 2.10kd/s
15 pii_presidio_v1 Tags spans of documents that contain personally identifiable information (PII) using the Presidio Analyzer library way to slow: about 7s per document. However analyzer_results in pii.py defines the language as English if . See line 110 in here
16 pii_regex_v1 Tags spans of documents that contain personally identifiable information (PII) using a set of regular expressions 2m55s, 1.53kd/s
17 pii_regex_v2 Faster implementation of pii_regex_v1 2m51s, 1.57kd/s
18 pii_regex_with_counts_v2 Tags spans of documents that contain personally identifiable information (PII) using a set of regular expressions. It also counts the number of matches for each regular expression 2m43s, 1.65kd/s
19 pii_regex_with_counts_fast_v2 Faster implementation of pii_regex_with_counts_v2 1m01s, 4.36kd/s
20 cld2_scandi_doc Language Detection using cld2 1m11s, 3.79kd/s
21 cld2_scandi_paragraph Language Detection on paragraph level using cld2 5m59s, 748d/s
22 ft_lang_id_scandi_doc FastText Language Detection 3m14s, 1.38kd/s
23 ft_lang_id_scandi_paragraph FastText Language Detection on paragraph level 14m06s, 318d/s
24 cld2_scandi_paragraph_with_doc_score Language Detection on paragraph level with a total score using cld2 8m04s, 556d/s
25 ft_lang_id_scandi_paragraph_with_doc_score FastText Language Detection on paragraph level with a total score 14m37s, 306d/s
26 jigsaw_hatespeech_document_v2 Tags documents as containing hate speech or not using a FastText classifier trained on the Jigsaw hate speech dataset. 1m38s, 2.74kd/s
27 jigsaw_hatespeech_sentence_v2 Tags spans of documents as containing hate speech or not using a FastText classifier trained on the Jigsaw hate speech dataset. 9m45s, 460d/s
28 jigsaw_nsfw_document_v1 Tags documents as containing NSFW content or not using a FastText classifier trained on the Jigsaw NSFW dataset. 6m40s, 671d/s
29 jigsaw_nsfw_sentence_v2 Tags spans of documents as containing NSFW content or not using a FastText classifier trained on the Jigsaw NSFW dataset. 9m02s, 496d/s
KennethEnevoldsen commented 10 months ago

Thanks @TTTTao725, if you have the time for the PR one of the following days that would be great to get it merged in such that you are not sitting with multiple tasks

peterbjorgensen commented 10 months ago

If I understand what you are saying correctly, you are running the tagger with only one tagger at a time and recording the time it took to run that tagger? I only get a single number for the time it took for every dolma tag run.

I wrote a small bash loop to do this and here are some numbers I got:

c4_v2
documents: 1.00kd [00:01, 530d/s]

ccnet_perplexity_paragraph_w_doc_da
documents: 1.00kd [00:03, 271d/s]

ccnet_perplexity_paragraph_w_doc_en
documents: 1.00kd [00:03, 253d/s]

char_length_strip_ws_v1
documents: 1.00kd [00:00, 2.83kd/s]

char_length_with_paragraphs_v1
documents: 1.00kd [00:01, 729d/s]

cld2_en_paragraph_with_doc_score_v2
documents: 1.00kd [00:03, 268d/s]

cld2_scandi_paragraph_with_doc_score
documents: 1.00kd [00:10, 94.4d/s]
documents: 1.00kd [00:10, 218d/s]

code_copyright_comments_v1
documents: 1.00kd [00:00, 2.27kd/s]

code_redpajama_taggers_v1
documents: 1.00kd [00:01, 678d/s]

code_secrets_v1
documents: 1.00kd [00:40, 24.8d/s]

code_starcoder_taggers_v1
documents: 1.00kd [00:00, 2.55kd/s]

code_starcoder_taggers_v2
documents: 1.00kd [00:00, 2.19kd/s]

ft_lang_id_scandi_paragraph_with_doc_score
documents: 1.00kd [00:17, 57.0d/s]

jigsaw_hatespeech_document_v2
documents: 1.00kd [00:02, 405d/s]

jigsaw_hatespeech_sentence_v2
documents: 1.00kd [00:04, 241d/s]

jigsaw_nsfw_document_v1
documents: 1.00kd [00:02, 410d/s]

jigsaw_nsfw_sencence_v2
documents: 1.00kd [00:04, 226d/s]

not_alphanum_paragraph_v1
documents: 1.00kd [00:00, 1.58kd/s]

olmo_pretokenizer_with_paragraphs_v1
documents: 1.00kd [00:04, 230d/s]

paragraph_repetitions_v1
documents: 1.00kd [01:07, 14.8d/s]

paragraph_tokenizer_repetitions_v1
documents: 1.00kd [00:19, 51.3d/s]

pii_presidio_v1
documents: 1.00kd [01:45, 9.44d/s]

pii_regex_with_counts_fast_v2
documents: 1.00kd [00:00, 1.07kd/s]

random_number_v1
documents: 1.00kd [00:00, 2.92kd/s]

repetitions_v1
documents: 1.00kd [01:07, 14.7d/s]

tokenizer_repetitions_v1
documents: 1.00kd [00:05, 173d/s]

tokenizers_AI2_OLMo_v1
documents: 1.00kd [00:03, 271d/s]

tokenizers_EleutherAI_GPT_NeoX_20B
documents: 1.00kd [00:03, 260d/s]

uniseg_length_paragraphs_with_doc_length_v1
documents: 1.00kd [00:09, 105d/s]

whitespace_tokenizer_with_paragraphs_v1
documents: 1.00kd [00:02, 455d/s]

I think I should remove all the taggers that took more than 10 seconds in my example (1000 documents only), maybe except for fasttext lang id.

KennethEnevoldsen commented 10 months ago

^yep exactly, we just did it to get an overview of what taggers were to slow to run in practice.

peterbjorgensen commented 9 months ago

A small update. I split the hplt dataset into 24 files and run the taggers I mentioned above. It has processed 9 files but the tagger seems to have frozen. I guess it is stuck in some cpu loop, because it is still running at 100% at many of the cores, but haven't written anything to the attributes files for two days. I am not sure which taggers are causing this. Some of the attributes files ends up being around 10 times larger than the actual data files.

peterbjorgensen commented 9 months ago

Another update: I have run the majority of the taggers one-by-one instead of a single run and it seems that the tagger not_alphanum_paragraph_v1 chokes on some of the text in the hplt dataset. It doesn't crash, but it just stops writing to the attributes files and keep running at 100% cpu. The tagger itself looks quite simple, but it could be that some of the regex queries it implements are extremely slow on some corner case text examples.

KennethEnevoldsen commented 9 months ago

This seems very odd. The function is quite simple. Can you identify the examples by just running the python implementation?

peterbjorgensen commented 9 months ago

I agree it's very odd. But it works when I exclude this tagger from the set. The command line tool is pure python when used for tagging. I can make a minimum working example to find the data examples it chokes on.

rlrs commented 9 months ago

The regex isn't really that simple, it looks like it could definitely involve a lot of backtracking and probably take forever on some edge case. I think that's likely what happens, can't you find the document it happens on?

peterbjorgensen commented 9 months ago

The regex isn't really that simple, it looks like it could definitely involve a lot of backtracking and probably take forever on some edge case. I think that's likely what happens, can't you find the document it happens on?

It looks like long sequences of emojis stalls the tagger forever

InputSpec(id='7', text='😠 😡', source='hplt1.2', version=None) took 0.000039 seconds

InputSpec(id='4', text='😠 😡 😤 😋 😎 🌦 🌧 🌜 🌈 🏝 🎅\n\nAnti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None) took 0.000025 seconds

InputSpec(id='11', text='😠 😡 😤 😋 😎 😴 😈 😇 😕 😏 😑 👲 👮 💂 👶 ❤ 💔 💕 💘 💌 💋 🎁 💰 💍 👍 👎 👌 ✌️ 🤘 👏 🎵 ☕️ 🍵 Anti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None) took 0.000021 seconds

InputSpec(id='5', text='😠 😡 😤 😋 😎 😴 😈 😇 😕 😏 😑 👲 👮 💂 👶 ❤ 💔 💕 💘 💌 💋 🎁 💰 💍 👍 👎 👌 ✌️ 🤘 👏 🎵 ☕️ 🍵 🍺 🍷 🍼 ☀️ 🌤 🌦 🌧 🌜 🌈 🏝 🎅\n\nAnti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None) took 64.204857 seconds

InputSpec(id='3', text='\nGæstebogs indlæg: \n😄 😃 😊 😉 😍 😚 😗 😜 😛 😳 😁 😬 😌 😞 😢 😂 😭 😅 😓 😩 😮 😱 😠 😡 😤 😋 😎 😴 😈 😇 😕 😏 😑 👲 👮 💂 👶 ❤ 💔 💕 💘 💌 💋 🎁 💰 💍 👍 👎 👌 ✌️ 🤘 👏 🎵 ☕️ 🍵 🍺 🍷 🍼 ☀️ 🌤 🌦 🌧 🌜 🌈 🏝 🎅\n\nAnti-Spam: \nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None) ... takes 'forever'

KennethEnevoldsen commented 9 months ago

Hmm that is odd a solution might be to wrap it in timeout:

https://stackoverflow.com/questions/492519/timeout-on-a-function-call

then simply have it return NaN?

peterbjorgensen commented 9 months ago

I think it would be better to either fix this tagger or add one that works. I don't understand why it is so slow. It basically checks if the string has any alphanumeric character in it and if not it checks if all the characters are punctuation. Seems like a bug in the regex implementation. I am not sure why they use a python package called regex instead of the one in the standard libraries re. If I swap the regex package with the standard re one it works with no problems. I will report the bug upstream.