ICIJ / datashare

A self-hosted search engine for documents.
https://datashare.icij.org
GNU Affero General Public License v3.0
584 stars 52 forks source link

fix: remove artificial line break for `.eml` NER #1313

Open ClemDoum opened 6 months ago

ClemDoum commented 6 months ago

The current Tika pipeline keeps line break added by email servers in order to fit the 78/998 max line length RFC limit. Ideally emails inside DS should display without these artificial line break. Minimally, the NER should get rid of these line breaks

github-actions[bot] commented 5 months ago

This issue is stale because it has been open for 40 days with no activity.

github-actions[bot] commented 4 months ago

This issue is stale because it has been open for 40 days with no activity.

github-actions[bot] commented 2 months ago

This issue is stale because it has been open for 40 days with no activity.

ClemDoum commented 2 months ago

Wait for Spacy NER to be implemented to allow for faster prototyping / easier text processing: https://github.com/ICIJ/datashare/issues/1452

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open for 40 days with no activity.