OpenSimulationInterface / open-simulation-interface

A generic interface for the environmental perception of automated driving functions in virtual scenarios.
Other
265 stars 124 forks source link

Add pyspelling html filter #795

Closed thomassedlmayer closed 3 months ago

thomassedlmayer commented 3 months ago

I noticed that adding new hyperlinks to the proto files requires to add any character sequences from the hyperlink as exceptions to the spellchecker (see #790). So I found that it is possible to add a html filter that excludes such character sequences.

Here I added a exemplary <a>-tag with a hyperlink that contains strings that should trigger a spellchecker fail. With the new html filter the hyperlink does not trigger the fail anymore but now it fails because it now detects a lot more strings that contain mutated vowels which were not detected before. I guess the html filter converts those special characters so that the spellchecker can also check those words.

@ClemensLinnhoff Could you please check if this type of filter does not screw things up? Was there any reason why this was not added before? If it works, we should probably add the "new" mutated vowel-words as exceptions but remove any exceptions that were previously added because they are contained inside hyperlinks (or also html tags/attribute names like cellspacing, href, colspan etc.). Also, wouldn't it make sense to add German as secondary language so that we don't have to add every German word to the exception list?

ClemensLinnhoff commented 3 months ago

Was there any reason why this was not added before?

I was not aware of this filter functionality. So good catch!

Also, wouldn't it make sense to add German as secondary language so that we don't have to add every German word to the exception list?

I agree, that would make sense for the current OSI definitions. But if I recall correctly, I tried this and found out that aspell does not support multiple languages. However, there seem to be some workarounds. On the other hand I would generally question, why there are links in a non-english language in this international standard. If we would add french, spanish and chinese traffic signs to OSI as well, spell checking in all these languages might get messy.

ClemensLinnhoff commented 3 months ago

but now it fails because it now detects a lot more strings that contain mutated vowels which were not detected before.

Seems like it. So either we add all these words with the Umlauts to the spelling_custom_words_en_US.txt or we need to figure out how to check multiple languages. I would opt to add them to the custom words and avoid using non-english words in the future.

thomassedlmayer commented 3 months ago

I see. If there is no straightforward way of adding another language to the checker, let's maybe postpone this for now.

I added the necessary exceptions to the whitelist so that the spellchecker does not fail anymore. There may still be some words in the white list that are not required with the html filter in place, like html-tags. We can also maybe get rid of those.

ClemensLinnhoff commented 3 months ago

Yes, there are also a lot of half words, as before, umlauts were not recognized, e.g. Parkfl for Parkflächen, Parkst for Parkstände, Rei for Reißverschluss etc. We might be able to also get rid of those.