clamsproject / app-dbpedia-spotlight-wrapper

CLAMS wrapper for DBpedia Spotlight
Apache License 2.0
0 stars 0 forks source link

NE category normalization #6

Closed keighrim closed 1 year ago

keighrim commented 1 year ago

This could be harder than it looks, but we might want to have "normalization" mapping from DBPedia categories to more commonly used NE categories. For now on some output data, we see these categories.

$  cat preds@dbpedia-spotlight-wrapper@aapb-collaboration-21/*.mmif | jq | grep '"category"' | sort -u
            "category": "501(c)(3) organization",
            "category": "501(c)(3)",
            "category": "501(c)(4)",
            "category": "Abstraction100002137",
            "category": "Administrative divisions of New York",
            "category": "Administrative divisions of Ohio",
            "category": "Aktiengesellschaft",
            "category": "American Bureau of Shipping",
            "category": "American Viticultural Area",
            "category": "Anime",
            "category": "Army",
            "category": "ArtificialSatellite",
            "category": "Assault rifle",
            "category": "Atmosphere114520278",
            "category": "Automatic rifle",
            "category": "Band",
            "category": "Benefit corporation",
            "category": "Bicameralism",
            "category": "Biological database",
            "category": "Book106410904",
            "category": "Boroughs of New York City",
            "category": "Bus",
            "category": "Capital city",
            "category": "Career and Technical Student Organization",
            "category": "Cartel",
            "category": "Catholic Church",
            "category": "ChangeOfState100199130",
            "category": "Charitable organisation",
            "category": "Charitable organization",
            "category": "Christianity",
            "category": "Cities in Israel",
            "category": "Cities of South Korea",
            "category": "City (Minnesota)",
            "category": "City (New York)",
            "category": "City government in Washington (state)",
            "category": "City",
            "category": "Class107997703",
            "category": "Coast guard",
            "category": "Coast",
            "category": "Consolidated city-county",
            "category": "Controversy107183151",
            "category": "Corpus separatum (Jerusalem)",
            "category": "Daily newspaper",
            "category": "Digital terrestrial television",
            "category": "Direct-administered municipalities of China",
            "category": "District108552138",
            "category": "E-reader",
            "category": "Election",
            "category": "EndProduct103287178",
            "category": "Environmental protection",
            "category": "Federal capital",
            "category": "FictionalCharacter109587565",
            "category": "Food preservation",
            "category": "Food",
            "category": "Forward operating base",
            "category": "Free-to-air",
            "category": "Function113783816",
            "category": "Game100456199",
            "category": "Garment103419014",
            "category": "Glacial lake",
            "category": "Gossip107223170",
            "category": "Government-owned corporation",
            "category": "Grant113266892",
            "category": "GroupAction101080366",
            "category": "Home video game console",
            "category": "Independent agencies of the United States government",
            "category": "Independent city (United States)",
            "category": "Institution",
            "category": "Intellectual109621545",
            "category": "Intelligence",
            "category": "Intergovernmental organization",
            "category": "International non-governmental organization",
            "category": "Issue105814650",
            "category": "Joint-stock",
            "category": "LanguageUnit106284225",
            "category": "Learned society",
            "category": "Limited liability company",
            "category": "List of capitals in the United States",
            "category": "List of cities and towns in Croatia",
            "category": "List of cities and towns in Lebanon",
            "category": "List of cities in Egypt",
            "category": "List of cities in Illinois",
            "category": "List of cities in Iraq",
            "category": "List of cities in Oman",
            "category": "List of cities in Ontario",
            "category": "List of cities in Quebec",
            "category": "List of cities in Yemen",
            "category": "List of communities in Miami-Dade County, Florida",
            "category": "List of international sport federations",
            "category": "List of municipalities in California",
            "category": "List of municipalities in Colorado",
            "category": "List of municipalities in Illinois",
            "category": "List of regencies and cities of Indonesia",
            "category": "List of regions of California",
            "category": "List of regions of the United States",
            "category": "List of specialized agencies of the United Nations",
            "category": "List of towns in Alberta",
            "category": "Mail order",
            "category": "Manchester Metrolink",
            "category": "Manga",
            "category": "Mars rover",
            "category": "Merchandise103748886",
            "category": "Metropolitan area",
            "category": "Metropolitan borough",
            "category": "Military alliance",
            "category": "Mormon studies",
            "category": "Mortar (weapon)",
            "category": "Municipalities of Norway",
            "category": "Municipalities of Spain",
            "category": "Municipalities of Sweden",
            "category": "Municipality",
            "category": "Music107020895",
            "category": "Mutual company",
            "category": "NASA facilities",
            "category": "NGO",
            "category": "National research and education network",
            "category": "Navy",
            "category": "Neighborhood",
            "category": "Neighborhoods in San Francisco",
            "category": "Newspaper",
            "category": "Non-Profit",
            "category": "Non-departmental public body",
            "category": "Non-governmental organization",
            "category": "Non-metropolitan district",
            "category": "Non-profit corporation",
            "category": "Non-profit organization",
            "category": "Non-profit",
            "category": "Nonprofit organization",
            "category": "Number106425065",
            "category": "Order107168623",
            "category": "Organization",
            "category": "Orientation106208021",
            "category": "Paramilitary",
            "category": "PartialDifferentialEquation106670866",
            "category": "Person100007846",
            "category": "PersonWithOccupation",
            "category": "PhysicalEntity100001930",
            "category": "Private University",
            "category": "Private company limited by shares",
            "category": "Private company",
            "category": "Private foundation",
            "category": "Private university",
            "category": "Privately held company",
            "category": "Professional association",
            "category": "Provinces of Afghanistan",
            "category": "Provinces of the Dominican Republic",
            "category": "Provisional government",
            "category": "Public Sector Undertakings in India",
            "category": "Public Transport Victoria",
            "category": "Public broadcasting",
            "category": "Public charity",
            "category": "Public college",
            "category": "Public company",
            "category": "Public policy",
            "category": "Public university",
            "category": "Region",
            "category": "Regional organization",
            "category": "Revenue service",
            "category": "Revolution107424109",
            "category": "Rule105846054",
            "category": "S.A. (corporation)",
            "category": "Salamander101629276",
            "category": "Satellite campus",
            "category": "Sea",
            "category": "SemiconductorDevice104171831",
            "category": "Serial (radio and television)",
            "category": "Service101209576",
            "category": "Single (music)",
            "category": "SocialGroup107950920",
            "category": "Società per azioni",
            "category": "Sound107371293",
            "category": "SpatialThing",
            "category": "Special cities of North Korea",
            "category": "Sports governing body",
            "category": "State-owned company",
            "category": "States and union territories of India",
            "category": "Statutory corporation",
            "category": "Stealth aircraft",
            "category": "Sub-provincial division",
            "category": "Subsidiary",
            "category": "Supranational union",
            "category": "Supreme Court of the United States case",
            "category": "System104377057",
            "category": "Tanker (ship)",
            "category": "Terrestrial television",
            "category": "Thing",
            "category": "Think tank",
            "category": "Timeline106504965",
            "category": "Unincorporated community",
            "category": "United Nations System",
            "category": "United States Federal Executive Departments",
            "category": "United States federal executive departments",
            "category": "Vehicle104524313",
            "category": "Village",
            "category": "Virginia Railway Express",
            "category": "Virtual community",
            "category": "Weapon104565375",
            "category": "Web portal",
            "category": "Whole100003553",
            "category": "Wide-body aircraft",
            "category": "WikicatAmericanFootballTeamsInWashington(state)",
            "category": "WikicatAmericanInventions",
            "category": "WikicatAmericanPeopleOfIrishDescent",
            "category": "WikicatApplicationLayerProtocols",
            "category": "WikicatArchaeologicalSitesInIsrael",
            "category": "WikicatArtificialSatellitesInGeosynchronousOrbit",
            "category": "WikicatBiobankOrganizations",
            "category": "WikicatChildren",
            "category": "WikicatConceptionsOfGod",
            "category": "WikicatDefunctNewspapersOfTheUnitedStates",
            "category": "WikicatDysphemisms",
            "category": "WikicatElectricPowerBlackouts",
            "category": "WikicatExhibitions",
            "category": "WikicatForests",
            "category": "WikicatFormalLanguages",
            "category": "WikicatHumanRights",
            "category": "WikicatLaboratoryTechniques",
            "category": "WikicatLegislatures",
            "category": "WikicatLibrariesInTheNetherlands",
            "category": "WikicatLinkProtocols",
            "category": "WikicatMaritimeIncidentsIn1989",
            "category": "WikicatMedievalWeapons",
            "category": "WikicatMilitaryDoctrines",
            "category": "WikicatNetworkProtocols",
            "category": "WikicatOrdersOfKnighthood",
            "category": "WikicatPalestinianRefugees",
            "category": "WikicatPoaceaeSubfamilies",
            "category": "WikicatShipMeasurements",
            "category": "WikicatStatisticalRatios",
            "category": "WikicatSubdivisionsOfChina",
            "category": "WikicatWeaponsCountermeasures",
            "category": "WikicatWeatherHazards",
            "category": "World Heritage Site",
            "category": "Writing100614224",
            "category": "Writing106362953",
            "category": "Wrongdoing100732746",
            "category": "activity",
            "category": "administrative region",
            "category": "agent",
            "category": "aircraft",
            "category": "airport",
            "category": "album",
            "category": "american football player",
            "category": "amusement park attraction",
            "category": "anatomical structure",
            "category": "animal",
            "category": "architect",
            "category": "architectural structure",
            "category": "automobile",
            "category": "award",
            "category": "bank",
            "category": "basketball league",
            "category": "basketball player",
            "category": "beverage",
            "category": "book",
            "category": "brain",
            "category": "broadcaster",
            "category": "building",
            "category": "chemical compound",
            "category": "chemical substance",
            "category": "city",
            "category": "clerical administrative region",
            "category": "comic",
            "category": "company",
            "category": "convention",
            "category": "country",
            "category": "cricketer",
            "category": "criminal",
            "category": "currency",
            "category": "dam",
            "category": "device",
            "category": "disease",
            "category": "drug",
            "category": "educational institution",
            "category": "engine",
            "category": "ethnic group",
            "category": "eukaryote",
            "category": "fictional character",
            "category": "gene",
            "category": "glacier",
            "category": "government agency",
            "category": "historic place",
            "category": "holiday",
            "category": "information appliance",
            "category": "infrastructure",
            "category": "insect",
            "category": "island",
            "category": "lacrosse player",
            "category": "lake",
            "category": "language",
            "category": "legislature",
            "category": "library",
            "category": "lighthouse",
            "category": "mammal",
            "category": "mean of transportation",
            "category": "medical specialty",
            "category": "military conflict",
            "category": "military person",
            "category": "military structure",
            "category": "military unit",
            "category": "mineral",
            "category": "mollusca",
            "category": "motorsport racer",
            "category": "mountain range",
            "category": "mountain",
            "category": "movie",
            "category": "museum",
            "category": "music genre",
            "category": "musical artist",
            "category": "musical work",
            "category": "musical",
            "category": "mythological figure",
            "category": "national football league event",
            "category": "newspaper",
            "category": "noble",
            "category": "office holder",
            "category": "organisation",
            "category": "periodical literature",
            "category": "person function",
            "category": "person",
            "category": "place",
            "category": "planet",
            "category": "plant",
            "category": "play",
            "category": "poem",
            "category": "political party",
            "category": "populated place",
            "category": "power station",
            "category": "programming language",
            "category": "protein",
            "category": "radio program",
            "category": "radio station",
            "category": "railway line",
            "category": "record label",
            "category": "restaurant",
            "category": "river",
            "category": "road",
            "category": "roller coaster",
            "category": "rugby player",
            "category": "saint",
            "category": "school",
            "category": "settlement",
            "category": "ship",
            "category": "single",
            "category": "soccer club",
            "category": "soccer player",
            "category": "soccer tournoment",
            "category": "societal event",
            "category": "software",
            "category": "song",
            "category": "space shuttle",
            "category": "species",
            "category": "sport",
            "category": "sports event",
            "category": "station",
            "category": "television episode",
            "category": "television season",
            "category": "television show",
            "category": "television station",
            "category": "topical concept",
            "category": "train",
            "category": "unit of work",
            "category": "university",
            "category": "venue",
            "category": "video game",
            "category": "village",
            "category": "weapon",
            "category": "website",
            "category": "winter sport Player",
            "category": "work",
            "category": "written work",
            "category": "پھُپھُوندی",
wricketts commented 1 year ago

@keighrim please see this gist on the issue. Maybe we can discuss how to address the limitations I've found?