Closed nataliashitova closed 5 years ago
Acceptance 🚧
I have some trouble knowing what specifically needs to get tested here.
When I have a text of 20 lines with: hello world
, I get 3 relevant word entries: hello
, world
and hello world
. Just like before.
However, when I set the keyphrase (this is the same with any other paper attributes -- except title, see below) to hello world
I would've expected to count towards all of them. But it adds two more entries for only hello
and world
. You told me this will be added in the next PR. Still, I would expect it to count towards the whole keyphrase hello world
too.
Is this on purpose?
A mismatch between changelog and testing: The SEO title is not used to determine relevant words.
Adding <h2>hello world</h2>
to the content adds a count to all the non-special entries. And includes the 2 special entries for 5 occurences. Therefor, it is counted twice. Should this be?
With the above situation the density
seems influenced by the <h2></h2>
part, since it is lower than just adding hello world
as keyphrase for example. 0.119
instead of the expected 0.125
. Thinking of this, the actual expected is 0.5
for the keyphrase and others, right?
Adding hello world
to all the attributes does not alter the amount of occurrences. Is this on purpose?
@igorschoester Thanks for your great questions!
- When I have a text of 20 lines with:
hello world
, I get 3 relevant word entries:hello
,world
andhello world
. Just like before. However, when I set the keyphrase (this is the same with any other paper attributes -- except title, see below) tohello world
I would've expected to count towards all of them. But it adds two more entries for onlyhello
andworld
. You told me this will be added in the next PR. Still, I would expect it to count towards the whole keyphrasehello world
too. Is this on purpose?
Yes. It did not add any more-than-one-word-long word combinations from the paper attributes, because they will soon be deprecated anyway.
- A mismatch between changelog and testing: The SEO title is not used to determine relevant words.
My bad 🤦♀️
- Adding
<h2>hello world</h2>
to the content adds a count to all the non-special entries. And includes the 2 special entries for 5 occurences. Therefor, it is counted twice. Should this be?
Yes, for now. I will take care of this inconsistency in one of the later PRs that will merge same-word occurrences in the list of relevant words.
- With the above situation the
density
seems influenced by the<h2></h2>
part, since it is lower than just addinghello world
as keyphrase for example.0.119
instead of the expected0.125
. Thinking of this, the actual expected is0.5
for the keyphrase and others, right?
Just as with the one above, I will solve this in the next PR. In general, density is of no importance per se, but it serves a proxy of importance. Therefore, as long as we make sure the right words
are being saved in the database, the theoretical impeccability of our ways of calculating the density of relevant words is not important.
- Adding
hello world
to all the attributes does not alter the amount of occurrences. Is this on purpose?
Right now I am saving all words from attributes with the number of occurrences = 5, which was chosen pretty much at random. The next PR with fine-tun the mechanism counting occurrences in text and attributes, as well as selecting good words to be saved in the database.
@hansjovis I agree that we should not allow duplicates in the list of relevant words. It makes a lot of sense to take care of such potential duplicates together with collapsing relevant words based on the stemming mechanism (which was implemented in #2150). What I want to do is:
Obviously, duplicates will have the same stem and would be collapsed by this stemming merging mechanism for free.
To reconcile both parties I created a feature branch for internal linking: This will allow to keep development in small chunks and not decrease the quality of the current suggestions. What do you think?
- A mismatch between changelog and testing: The SEO title is not used to determine relevant words.
Acceptance on this 👍
Unfortunately, the devtool now compiles with the following warning:
./src/components/RelevantWords.js
Line 75: Function 'getRelevantWords' has a complexity of 7 complexity
Search for the keywords to learn more about each warning.
To ignore, add // eslint-disable-next-line to the line before.
After a small discussion, it seems best to disable this rule for the example.
Merging on the premise that the above issues will be taken into consideration in the feature branch.
Summary
This PR can be summarized in the following changelog entry:
Relevant technical choices:
Test instructions
This PR can be tested by following these steps:
example
to check if words from paper attributes are added to the list of relevant words.Fixes #2114