Closed felixschorer closed 7 years ago
SELECT "c"."id", "c"."occurrences", "c"."entityId", "w"."url", "c"."websiteId", "w"."blob_url"
FROM public.contains as "c", public.websites as "w"
WHERE "w"."id" = "c"."websiteId"
GROUP BY "w"."id", "c"."id"
ORDER BY "w"."url", "c"."occurrences"
I just ran this query to give a more in depth look of the the test results from yesterday night. But pgAdmin refuses to save it as a .csv
file...
Here are the query results:
Don't use pgAdmin 4. It sucks. pgAdmin 3 is much better
Thanks, @nyxathid. So this should be the results for the first of the 66500 WET files from CC-2017-13 with a heuristic threshold of 4 and the blacklist from e7cfc46ede4b8277957b50fe14df8a607ae5dbee in place.
Just to recap the heuristic, a threshold of 4 means:
@goldbergtatyana
query.xlsx That should be much more serviceable 😃
EDIT: I've just noticed that Excel is having problems displaying accented letters properly...
@felixschorer I'd recommend using LibreOffice when importing/exporing csv data. Excel usually assumes encoding/escaping/separators by region and or language (which can be quite a PITA) where LO just asks.
Thanks @pfent! query.xlsx
Ok, I'll do some testing now. I'll test heuristic threshold levels from 2 to 6 and will post the results here.
Ok, so here are my results from running our app 9 times over the first four files of CC-MAIN-2017-13
. The file name tells you which heuristic threshold has been used.
run-t2-r0:4.csv
-> threshold of 2run-t3-r0:4.csv
-> threshold of 3CSV: run-r0-4.zip XLSX: run-r0-4-xlsx.zip
@goldbergtatyana
Threshold of
From what I found is that with higher thresholds the quality of our matched sites increases. BUT the higher the threshold, the more the heuristic favours simple lists of names without much information on them.
We should probably find a good medium between these two. Simple lists won't help the relationship group at all, neither do webpages with almost no matched terms.
I'd say 4 or 5 would be perfect.
Here are some more files for thresholds of 4 and 6 joined with artists. run2-r0-4.zip
@felixschorer and I came up with a quick and effective solution for filtering websites.
We base our statistics on terms "mozart" (most popular term), "brahms" (we want to release MCM on time for his birthday) and "lullaby" (music piece name @vviro gave us).
We estimate the number of websites that mention these terms at each threshold.
Then we count how many of these website contain "music|music|sound|classic|opera|notes" in their URL.
Following results are based on four WET files
At threshold 4: mozart 49/143 = 34% (of URLs where term was matched contain "music|music|sound|classic|opera|notes") brahms 37/138 = 29% lullaby 6/55 = 11%
At threshold 5: mozart 34/86 = 40% brahms 28/70 = 40% lullaby 5/25 = 20%
At threshold 6: mozart 32/72 = 44% brahms 26/36 = 72% lullaby 5/22 = 23%
At threshold 7: mozart 24/50 = 48% brahms 20/25 = 80% lullaby 3/16 = 19%
At threshold 8: mozart 17/35 = 49% brahms 14/17=78% lullaby 3/12 = 25%
so there is no difference between thresholds 7 and 8. We also noticed that those URLs that remain after filtering at high thresholds are just lists of terms and thus useless (e.g. http://www.di-arezzo.co.uk/scores-of-Christine+McVie.html)
Therefore to remove those lists we will set an upper boundary for the threshold and will find the websites in the range of 2(or 3)<x<7
@felixschorer will finalize the threshold settings tomorrow (on Friday)
and Group 2 will be ready to start running its scripts right after that!!!
Upper limit for the heuristic has been implemented in #201. I'll do some more test runs regarding the threshold and upper limit. I'll test [threshold]:[limit] (inclusive:exclusive) on the second most set of 4 WET files:
Results:
run3-t4-7.tsv
for example would be the data for 4:7 (3<x<7)run3-t4.tsv
for example would be the data for 4:infinity (3<x<∞)I'll also do a quick test on WARC files, so exact copies of the HTML content instead of only extracted text. I've found this node package which seems to ignore lists entirely.
EDIT: The results which it produces are excellent. BUT it just crashed on me (reached maximum call stack size, duh...) and is super slow.
What if we build a module which reduces the WET file content based on line length?
That should get rid of lists and shouldn't be expensive at all to do.
Ok, so I've implemented the module to reduce WET file content down to relevant content based on line length from my earlier post. #202
The results so far look really good. We're getting a lot of text now in the outputted WET files instead lists.
These are the results from my tests so far:
run4-t3-al-40.tsv
would be a heuristic threshold of 3 and an avg line length of 40I'll do some more tests on the WET files from yesterday, so we can compare the results. @goldbergtatyana
Ok, so here are the results of the latest run with thresholds ranging from 4 to 8. The tests were done on the same WET files as yesterdays tests and the file content was reduced to an average line length of 100 characters.
Average line length of 100 and threshold of
Average line length of 200 and threshold of
@goldbergtatyana
I have only found a single list at threshold 8 so far! 😄 The average line length is at over 600 in the generated WET file! Compare that to the 40ish characters per line we've had before this change! I think we can easily increase the required avg line length by a fair bit.
For the final run I suggest using both, a threshold limit and an average line length of 200. I am trying to figure out the upper limit right now. 20 should be a good starting point from my tests.
An upper limit below 20 would throw too many valuable sources away. And a limit of 20 eliminates is still low enough to eliminate sites like http://www.di-arezzo.co.uk
EDIT: Even 30 is enough to keep that site out of our results, 40 is too high though.
Here are the results for my last post: run6-r0-4.zip
@goldbergtatyana
Optimal settings have been determined, can be closed.
Do some more test runs to find a few more candidates for
term-blacklist.txt
Do some tests to gather statistics with different heuristic thresholds for @goldbergtatyana
NOTE: Add extra column to
.csv
file to contain matched terms when sharing them here. (#198)