Is there somewhere that explains clearly the process used to determine create the vector for each sentence?
Is there any preprocessing is applied? Eg are stop words removed?
Are the sentences treated as just a bag of words?
Is the structure of the sentence considered?
is TFIDF applied to apply weighting to words prior to vectorisation?
Are distances between "important" words considered?
Is any of this customisable?
Is the expectation that we pre-process and manipulate our corpuses first?
I would be particularly interested in mechanisms that could be applied to determine the nature of text. For example 3 documents may be about the same subject matter - but one is a factual, one is fictional and another is opinion. I think this would require the structure of the sentence to be considered as well as the words themselves.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Is there somewhere that explains clearly the process used to determine create the vector for each sentence?
Is any of this customisable? Is the expectation that we pre-process and manipulate our corpuses first?
I would be particularly interested in mechanisms that could be applied to determine the nature of text. For example 3 documents may be about the same subject matter - but one is a factual, one is fictional and another is opinion. I think this would require the structure of the sentence to be considered as well as the words themselves.