Helsinki-NLP / OpusFilter

OpusFilter - Parallel corpus processing toolkit
MIT License
101 stars 18 forks source link

Option to keep blank lines #4

Closed jbrry closed 3 years ago

jbrry commented 3 years ago

Hi there,

I am using OpusFilter on the nlingual-rebase branch to train a monolingual BERT model. In some of my corpora, there are empty lines which denote a document boundary, e.g. an empty line between two Wikipedia articles.

In the BERT README they mention:

"The input is a plain text file, with one sentence per line. (It is important that these be actual sentences for the "next sentence prediction" task). Documents are delimited by empty lines."

So I want to keep these empty lines where possible, so that BERT knows where a document ends for its next-sentence-prediction task (where a randomly sampled document is used as a negative example for its NSP task).

I'm just wondering would it be possible to add a feature to OpusFilter where the user can specify to keep empty lines? In my current configuration below, empty lines are removed from the example.txt file.

example.txt

common:

  output_directory: tests/data 

steps:

  - type: filter
    parameters:
      inputs: [example.txt]
      outputs: [example-filtered.txt]
      filters:
        - LengthFilter:
            unit: word
            min_length: 1
            max_length: 100

        - LongWordFilter:
            threshold: 40

        - HtmlTagFilter: {}

        - CharacterScoreFilter:
            scripts: [Latin]
            thresholds: [0.5]

        - LanguageIDFilter:
            name: langid
            id_method: langid
            languages: [ga]
            thresholds: [0.5]

        - LanguageIDFilter:
            name: cld2
            id_method: cld2
            languages: [ga]
            thresholds: [0.5]
svirpioj commented 3 years ago

Thanks for the suggestion! This does sound like a useful feature.

I considered what would be the best way to implement it. A global pass_empty setting for the filter command would sound the best for me, but unfortunately it is difficult to do with the current implementation based on iterators and generators.

The second option is to fix all individual filters either to pass the blank lines through or add an option for that. I went through the simple filters implemented in the filters module, and noticed that LengthRatioFilter and LanguageIDFilter actually didn't return very sensible results on blank lines. I fixed those and also added pass_empty option for LengthFilter. With these changes, I think your example should work.

However, this is not completely solved yet. I haven't looked at e.g. on the behavior of the language model and alignment filters on empty data.

jbrry commented 3 years ago

Hi Sami,

Thank you for your prompt response and changes. You're right, the blank lines are still included with my example file/config now which is very helpful, thanks!

No worries that the language model and alignment filters do not support this yet. The above changes should be ok for my needs so there's no rush with this from my end but I will leave it to you to decide if you want to keep the issue open until they are changed.

svirpioj commented 3 years ago

I added score_for_empty option for CrossEntropyFilter, CrossEntropyDifferenceFilter, and WordAlignFilter. When set to a value lower/higher than the threholds, it can be ensured that the empty lines are always passed/rejected. For WordAlignFilter, I noticed that eflomal actually fails for empty input, so I needed to set a default value.

As far as I see, keeping black lines should now be possible for all the implemented filters.