biolab / text-semantics

The package with scripts for semantic analyser project
MIT License
4 stars 5 forks source link

Comparison of preprocessing results #15

Closed ajdapretnar closed 3 years ago

ajdapretnar commented 3 years ago

Script comparing results from original and preprocessed document.

  1. I didn't know where to put the script, so made it 01-03, but could probably be a part of 02-01.
  2. We could also think about removing certain words, such as "odstavek" and "zakon". These are usually not a part of structural sentences, but hint at legal speak.
  3. Should we remove everything in parenthesis?
PrimozGodec commented 3 years ago
  1. It sure must be 02_something. Maybe it can be 02_02 and we rename other notebooks.
  2. I agree with you
  3. Probably there is no important text in parenthesis but I am not sure about that. Maybe we can decide about that when we will use it in practice (predicting, document map) and test what works better.
ajdapretnar commented 3 years ago
  1. Here is my suggestions for renaming the scripts. 03 is visualizations, 04 is extracting interesting words.
  2. I've added stopwords and scripts in #17.
  3. Will do it at a later point if necessary.