MusicConnectionMachine / UnstructuredData

In this project we will be scanning unstructured online resources such as the common crawl data set
GNU General Public License v3.0
3 stars 1 forks source link

Expand blacklist and gather heuristic statistics #196

Closed felixschorer closed 7 years ago

felixschorer commented 7 years ago

NOTE: Add extra column to .csv file to contain matched terms when sharing them here. (#198)

felixschorer commented 7 years ago
SELECT "c"."id", "c"."occurrences", "c"."entityId", "w"."url", "c"."websiteId", "w"."blob_url" 
FROM public.contains as "c", public.websites as "w" 
WHERE "w"."id" = "c"."websiteId" 
GROUP BY "w"."id", "c"."id" 
ORDER BY "w"."url", "c"."occurrences"

I just ran this query to give a more in depth look of the the test results from yesterday night. But pgAdmin refuses to save it as a .csv file...

nbasargin commented 7 years ago

Here are the query results:

Don't use pgAdmin 4. It sucks. pgAdmin 3 is much better

felixschorer commented 7 years ago

Thanks, @nyxathid. So this should be the results for the first of the 66500 WET files from CC-2017-13 with a heuristic threshold of 4 and the blacklist from e7cfc46ede4b8277957b50fe14df8a607ae5dbee in place.

Just to recap the heuristic, a threshold of 4 means:

@goldbergtatyana

felixschorer commented 7 years ago

query.xlsx That should be much more serviceable 😃

EDIT: I've just noticed that Excel is having problems displaying accented letters properly...

pfent commented 7 years ago

@felixschorer I'd recommend using LibreOffice when importing/exporing csv data. Excel usually assumes encoding/escaping/separators by region and or language (which can be quite a PITA) where LO just asks.

felixschorer commented 7 years ago

Thanks @pfent! query.xlsx

felixschorer commented 7 years ago

Ok, I'll do some testing now. I'll test heuristic threshold levels from 2 to 6 and will post the results here.

felixschorer commented 7 years ago

Ok, so here are my results from running our app 9 times over the first four files of CC-MAIN-2017-13. The file name tells you which heuristic threshold has been used.

CSV: run-r0-4.zip XLSX: run-r0-4-xlsx.zip

@goldbergtatyana

felixschorer commented 7 years ago

Threshold of

From what I found is that with higher thresholds the quality of our matched sites increases. BUT the higher the threshold, the more the heuristic favours simple lists of names without much information on them.

We should probably find a good medium between these two. Simple lists won't help the relationship group at all, neither do webpages with almost no matched terms.

I'd say 4 or 5 would be perfect.

felixschorer commented 7 years ago

Here are some more files for thresholds of 4 and 6 joined with artists. run2-r0-4.zip

goldbergtatyana commented 7 years ago

@felixschorer and I came up with a quick and effective solution for filtering websites.

We base our statistics on terms "mozart" (most popular term), "brahms" (we want to release MCM on time for his birthday) and "lullaby" (music piece name @vviro gave us).

We estimate the number of websites that mention these terms at each threshold.

Then we count how many of these website contain "music|music|sound|classic|opera|notes" in their URL.

Following results are based on four WET files

At threshold 4: mozart 49/143 = 34% (of URLs where term was matched contain "music|music|sound|classic|opera|notes") brahms 37/138 = 29% lullaby 6/55 = 11%

At threshold 5: mozart 34/86 = 40% brahms 28/70 = 40% lullaby 5/25 = 20%

At threshold 6: mozart 32/72 = 44% brahms 26/36 = 72% lullaby 5/22 = 23%

At threshold 7: mozart 24/50 = 48% brahms 20/25 = 80% lullaby 3/16 = 19%

goldbergtatyana commented 7 years ago

At threshold 8: mozart 17/35 = 49% brahms 14/17=78% lullaby 3/12 = 25%

so there is no difference between thresholds 7 and 8. We also noticed that those URLs that remain after filtering at high thresholds are just lists of terms and thus useless (e.g. http://www.di-arezzo.co.uk/scores-of-Christine+McVie.html)

Therefore to remove those lists we will set an upper boundary for the threshold and will find the websites in the range of 2(or 3)<x<7

@felixschorer will finalize the threshold settings tomorrow (on Friday)

and Group 2 will be ready to start running its scripts right after that!!!

felixschorer commented 7 years ago

Upper limit for the heuristic has been implemented in #201. I'll do some more test runs regarding the threshold and upper limit. I'll test [threshold]:[limit] (inclusive:exclusive) on the second most set of 4 WET files:

Results:

felixschorer commented 7 years ago

I'll also do a quick test on WARC files, so exact copies of the HTML content instead of only extracted text. I've found this node package which seems to ignore lists entirely.

EDIT: The results which it produces are excellent. BUT it just crashed on me (reached maximum call stack size, duh...) and is super slow.

felixschorer commented 7 years ago

What if we build a module which reduces the WET file content based on line length?

  1. We make an ordered list of all lines in a file and sort them by the line length.
  2. We then calculate the average line length and remove line after line, starting at the shortest line until the average line length reaches a set minimum.
  3. We reassemble the file and run filters over it

That should get rid of lists and shouldn't be expensive at all to do.

felixschorer commented 7 years ago

Ok, so I've implemented the module to reduce WET file content down to relevant content based on line length from my earlier post. #202

The results so far look really good. We're getting a lot of text now in the outputted WET files instead lists.

These are the results from my tests so far:

I'll do some more tests on the WET files from yesterday, so we can compare the results. @goldbergtatyana

felixschorer commented 7 years ago

Ok, so here are the results of the latest run with thresholds ranging from 4 to 8. The tests were done on the same WET files as yesterdays tests and the file content was reduced to an average line length of 100 characters.

Average line length of 100 and threshold of

run5-r0-4.zip

Average line length of 200 and threshold of

run5-al200-r0-4.zip

@goldbergtatyana

I have only found a single list at threshold 8 so far! 😄 The average line length is at over 600 in the generated WET file! Compare that to the 40ish characters per line we've had before this change! I think we can easily increase the required avg line length by a fair bit.

felixschorer commented 7 years ago

For the final run I suggest using both, a threshold limit and an average line length of 200. I am trying to figure out the upper limit right now. 20 should be a good starting point from my tests.

An upper limit below 20 would throw too many valuable sources away. And a limit of 20 eliminates is still low enough to eliminate sites like http://www.di-arezzo.co.uk

EDIT: Even 30 is enough to keep that site out of our results, 40 is too high though.

felixschorer commented 7 years ago

Here are the results for my last post: run6-r0-4.zip

@goldbergtatyana

felixschorer commented 7 years ago

Optimal settings have been determined, can be closed.