MusicConnectionMachine / UnstructuredData

In this project we will be scanning unstructured online resources such as the common crawl data set
GNU General Public License v3.0
3 stars 1 forks source link

Added blacklist with terms that we should not match #192

Closed lukasstreit closed 7 years ago

lukasstreit commented 7 years ago

Term Blacklist

This is version 0.1 and should definetely be improved in the future.. I've been over the terms by hand to pick out the ones that should not be in the blacklist. They have mostly been picked using the following query:

SELECT * 
FROM entities LEFT JOIN artists ON artists."entityId" = entities.id 
   LEFT JOIN instruments ON instruments."entityId" = entities.id 
   LEFT JOIN works ON works."entityId" = entities.id 
   LEFT JOIN releases ON releases."entityId" = entities.id
WHERE entities.id IN (
   SELECT "entityId"
   FROM contains GROUP BY "entityId" 
   ORDER BY COUNT(*) DESC 
   LIMIT 1000)

PS: Yes, I know this query can be improved.