Expand blacklist and gather heuristic statistics

felixschorer commented 7 years ago

Do some more test runs to find a few more candidates for term-blacklist.txt
Do some tests to gather statistics with different heuristic thresholds for @goldbergtatyana

NOTE: Add extra column to .csv file to contain matched terms when sharing them here. (#198)

felixschorer commented 7 years ago

SELECT "c"."id", "c"."occurrences", "c"."entityId", "w"."url", "c"."websiteId", "w"."blob_url" 
FROM public.contains as "c", public.websites as "w" 
WHERE "w"."id" = "c"."websiteId" 
GROUP BY "w"."id", "c"."id" 
ORDER BY "w"."url", "c"."occurrences"

I just ran this query to give a more in depth look of the the test results from yesterday night. But pgAdmin refuses to save it as a .csv file...

nbasargin commented 7 years ago

Here are the query results:

complete: query_results.zip
terms only: query_terms_only.zip

Don't use pgAdmin 4. It sucks. pgAdmin 3 is much better

felixschorer commented 7 years ago

Thanks, @nyxathid. So this should be the results for the first of the 66500 WET files from CC-2017-13 with a heuristic threshold of 4 and the blacklist from e7cfc46ede4b8277957b50fe14df8a607ae5dbee in place.

Just to recap the heuristic, a threshold of 4 means:

at least 4 unique term matches
and at least 16 term matches per page

@goldbergtatyana

felixschorer commented 7 years ago

query.xlsx That should be much more serviceable 😃

EDIT: I've just noticed that Excel is having problems displaying accented letters properly...

pfent commented 7 years ago

@felixschorer I'd recommend using LibreOffice when importing/exporing csv data. Excel usually assumes encoding/escaping/separators by region and or language (which can be quite a PITA) where LO just asks.

felixschorer commented 7 years ago

Thanks @pfent! query.xlsx

felixschorer commented 7 years ago

Ok, I'll do some testing now. I'll test heuristic threshold levels from 2 to 6 and will post the results here.

felixschorer commented 7 years ago

Ok, so here are my results from running our app 9 times over the first four files of CC-MAIN-2017-13. The file name tells you which heuristic threshold has been used.

run-t2-r0:4.csv -> threshold of 2
run-t3-r0:4.csv -> threshold of 3
...

CSV: run-r0-4.zip XLSX: run-r0-4-xlsx.zip

@goldbergtatyana

felixschorer commented 7 years ago

Threshold of

2 -> 108299 results
3 -> 66555 results
4 -> 44470 results
5 -> 32100 results
6 -> 24624 results
7 -> 19323 results
8 -> 15315 results
9 -> 11495 results
10 -> 8296 results

From what I found is that with higher thresholds the quality of our matched sites increases. BUT the higher the threshold, the more the heuristic favours simple lists of names without much information on them.

We should probably find a good medium between these two. Simple lists won't help the relationship group at all, neither do webpages with almost no matched terms.

I'd say 4 or 5 would be perfect.

felixschorer commented 7 years ago

Here are some more files for thresholds of 4 and 6 joined with artists. run2-r0-4.zip

goldbergtatyana commented 7 years ago

@felixschorer and I came up with a quick and effective solution for filtering websites.

We base our statistics on terms "mozart" (most popular term), "brahms" (we want to release MCM on time for his birthday) and "lullaby" (music piece name @vviro gave us).

We estimate the number of websites that mention these terms at each threshold.

Following results are based on four WET files

At threshold 5: mozart 34/86 = 40% brahms 28/70 = 40% lullaby 5/25 = 20%

At threshold 6: mozart 32/72 = 44% brahms 26/36 = 72% lullaby 5/22 = 23%

At threshold 7: mozart 24/50 = 48% brahms 20/25 = 80% lullaby 3/16 = 19%

goldbergtatyana commented 7 years ago

At threshold 8: mozart 17/35 = 49% brahms 14/17=78% lullaby 3/12 = 25%

so there is no difference between thresholds 7 and 8. We also noticed that those URLs that remain after filtering at high thresholds are just lists of terms and thus useless (e.g. http://www.di-arezzo.co.uk/scores-of-Christine+McVie.html)

Therefore to remove those lists we will set an upper boundary for the threshold and will find the websites in the range of 2(or 3)<x<7

@felixschorer will finalize the threshold settings tomorrow (on Friday)

and Group 2 will be ready to start running its scripts right after that!!!

felixschorer commented 7 years ago

Upper limit for the heuristic has been implemented in #201. I'll do some more test runs regarding the threshold and upper limit. I'll test [threshold]:[limit] (inclusive:exclusive) on the second most set of 4 WET files:

[x] 3:7 (2<x<7) -> 12435 entries
[x] 3:8 (2<x<8) -> 16301 entries
[x] 3:9 (2<x<9) -> 20782 entries
[x] 3:10 (2<x<10) -> 24415 entries
[x] 3:20 (2<x<20) -> 43468 entries
[x] 3:infinity (2<x<∞) ->67197 entries
[x] 4:7 (3<x<7) ->2608 entries
[x] 4:8 (3<x<8) -> 3712 entries
[x] 4:9 (3<x<9) -> 5187 entries
[x] 4:10 (3<x<10) -> 6357 entries
[x] 4:20 (3<x<20) -> 20386 entries
[x] 4:infinity (3<x<∞) -> 44116 entires

Results:

run3.zip run3-t4-7.tsv for example would be the data for 4:7 (3<x<7)
run3-no-limit.zip run3-t4.tsv for example would be the data for 4:infinity (3<x<∞)

felixschorer commented 7 years ago

I'll also do a quick test on WARC files, so exact copies of the HTML content instead of only extracted text. I've found this node package which seems to ignore lists entirely.

EDIT: The results which it produces are excellent. BUT it just crashed on me (reached maximum call stack size, duh...) and is super slow.

felixschorer commented 7 years ago

What if we build a module which reduces the WET file content based on line length?

We make an ordered list of all lines in a file and sort them by the line length.
We then calculate the average line length and remove line after line, starting at the shortest line until the average line length reaches a set minimum.
We reassemble the file and run filters over it

That should get rid of lists and shouldn't be expensive at all to do.

felixschorer commented 7 years ago

Ok, so I've implemented the module to reduce WET file content down to relevant content based on line length from my earlier post. #202

The results so far look really good. We're getting a lot of text now in the outputted WET files instead lists.

These are the results from my tests so far:

run4.zip run4-t3-al-40.tsv would be a heuristic threshold of 3 and an avg line length of 40

I'll do some more tests on the WET files from yesterday, so we can compare the results. @goldbergtatyana

felixschorer commented 7 years ago

Ok, so here are the results of the latest run with thresholds ranging from 4 to 8. The tests were done on the same WET files as yesterdays tests and the file content was reduced to an average line length of 100 characters.

Average line length of 100 and threshold of

4 ->18262 results
5 -> 12924 results
6 -> 9479 results
7 -> 5760 results
8 -> 5683 results

run5-r0-4.zip

Average line length of 200 and threshold of

4 -> 14943 results
5 -> 10557 results

run5-al200-r0-4.zip

@goldbergtatyana

I have only found a single list at threshold 8 so far! 😄 The average line length is at over 600 in the generated WET file! Compare that to the 40ish characters per line we've had before this change! I think we can easily increase the required avg line length by a fair bit.

felixschorer commented 7 years ago

For the final run I suggest using both, a threshold limit and an average line length of 200. I am trying to figure out the upper limit right now. 20 should be a good starting point from my tests.

An upper limit below 20 would throw too many valuable sources away. And a limit of 20 eliminates is still low enough to eliminate sites like http://www.di-arezzo.co.uk

EDIT: Even 30 is enough to keep that site out of our results, 40 is too high though.

felixschorer commented 7 years ago

Here are the results for my last post: run6-r0-4.zip

@goldbergtatyana

felixschorer commented 7 years ago

Optimal settings have been determined, can be closed.

MusicConnectionMachine / UnstructuredData

Expand blacklist and gather heuristic statistics #196