Closed MichaelKohler closed 4 years ago
Thanks @MichaelKohler for checking. BBC, BKV, BL are well-known abbreviations used as is, even spoken. Some of the others are also popular. There are a few which are not known and could be filtered. B- is not a common prefix, it should be filtered as well as all the sentences you highlighted in the end. There is something missing from those sentences. It looks like something has been removed after the first word, the A which is an article in Hungarian. Is there a way to figure out what caused the incorrect sentences in the second case?
BBC, BKV, BL are well-known abbreviations used as is, even spoken. Some of the others are also popular. There are a few which are not known and could be filtered.
While most people will know them and know how to pronounce them, some might not. That's why we decided quite early on to not allow any abbreviation, so let's remove all abbreviations like that.
It looks like something has been removed after the first word, the A which is an article in Hungarian. Is there a way to figure out what caused the incorrect sentences in the second case?
You could download the dump at https://dumps.wikimedia.org/huwiki/latest/huwiki-latest-pages-articles.xml.bz2 and search for the sentence in there. I'm fairly sure that it's some kind of Wikipedia formatting syntax that gets stripped out by WikiExtractor, I've encountered these cases before. That's also one of the reasons why we allow a certain error rate. I'd say let's see if there is a regex that filters out some of them, something like [A-Z]+-|\s-
maybe?
Overall these cases probably are few compared to all sentences, so more complex things like "A A .." we might just ignore. I'm not sure how much sense it makes to add that specifically to the regex and I'm fine with not doing that.
Thank you!
@djlancelot do you need any help here?
@djlancelot I've noticed quite a few abbreviations in the final export for the Hungarian Wikipedia. Here are some examples from the diff I did:
Would you mind adding an additional rule for those? You probably can copy the rule from English, there should be one that disallows abbreviations like that.
I've also noticed some other sentences, can you confirm that those are correct sentences?
Is "B-" a common prefix for words? And are the dashes at the beginning of the words ok? (Sorry, I have absolutely no idea how Hungarian works. These might also just have some specific formatting that gets stripped out by WikiExtractor.
Thanks!