boun-tabi-LMG / turkish-academic-text-harvest

MIT License
2 stars 0 forks source link

Perform correction before/after captions/tables/figures to drop kept lines #7

Closed gokceuludogan closed 11 months ago

gokceuludogan commented 1 year ago

The correction step in the script should be performed considering the surrounding lines. This adjustment will help eliminate any lines that are kept between lines intended to be dropped, resulting in more accurate filtering.

gokceuludogan commented 1 year ago

This can be solved by perplexity filtering.

gokceuludogan commented 11 months ago

The mark_items() function was introduced in f2c10d2 to detect captions and correct the preceding or following lines, considering the surrounding lines and their average token length and digit ratio. Note that this function significantly increases the time required to process a document as it processes the lines in a sliding window way to correct the anomalies.