Open tommasocarraro opened 2 years ago
I have another question regarding the dataset. This is a follow-up one.
I can see that, for example, the movie "The Jungle Book" appears three times in the entities.csv file. This makes sense because they are three different versions of the same movie. However, in the MovieLens-100k dataset with 9000 movies, that you have selected for creating MindReader, there is only one "The Jungle Book", the one corresponding to this URI in Wikidata: https://www.wikidata.org/wiki/Q16857406.
If you have created the MindReader dataset from this MovieLens dataset, it is not clear where you found the other "The Jungle Book".
In my experiments, I would like to merge the ratings of MovieLens with the ratings of MindReader to perform some investigations. It is very difficult since I am not able to create such a mapping.
Hi @tommasocarraro,
We will try to fix both of these problems. I actually have a mapping from the ML-20m items to Wikidata URLs, you can download it via the following link: https://we.tl/t-jkzXR371fs. It is a JSON file containing a dictionary mapping the items IMDB ID to the WikiData URI. I note that the file is only available for download for the next 7 days, but hope it is fine until we have the proper information available on our site.
If you want to create a similar file for a different ML dataset you can use the following query:
SELECT ?label ?film WHERE {
?film wdt:P345 ?label.
VALUES ?label {
// Insert IMDB ids separated by spaces
}
SERVICE wikibase:label {{ bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
And here as an example (note 'tt' are prepended to the imdb id):
SELECT ?label ?film WHERE {
?film wdt:P345 ?label.
VALUES ?label {
"tt0114709" "tt0113497" "tt0113228" "tt0114885"
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}
And here how to execute the example query with python assuming we define query
as the example query:
from SPARQLWrapper import SPARQLWrapper, JSON, POST
endpoint_url = 'https://query.wikidata.org/sparql'
user_agent = 'some_agent_name' # Remember to define a proper agent name
sparql = SPARQLWrapper(endpoint_url)
sparql.addCustomHttpHeader('User-Agent', user_agent)
sparql.addCustomHttpHeader('Retry-After', '2')
sparql.setQuery(query)
sparql.setReturnFormat(JSON)
sparql.setMethod(POST)
result = sparql.query().convert()['results']['bindings']
movie_uri = {}
for r in result:
movie_uri[r['label']['value'].lstrip('tt')] = r['film']['value']
You can add around 5000 IMDB IDs per query but you can play with the exact number. Furthermore, look at this guide for information on the amount of queries you can make to the endpoint: https://www.mediawiki.org/wiki/Wikidata_Query_Service/User_Manual.
And sorry for the late reply.
Hi @theisjendal, Thank you very much for all the precious information.
I will try to create a mapping between movielens-100k and movielens-latest-small (i.e., the dataset you used in the paper) datasets. I will try to create this mapping with title and year match. I want to do that because movielens-latest-small has updated information about iMDB links. Then, from the iMDB links, I will try to get the wikidata URLs for the movies in movielens-100k movies by using you approach of queries.
We could leave this issue open until the issue on the entities is checked and fixed. Thank you again!
Hello everyone, I am here again since I will use your interesting dataset in another project.
Today, I think I have found another problem. In the paper, you wrote that the MindReader-KG is constructed over a subset of 9,000 movies of the MovieLens-100k dataset. I think there is an error here.
MovieLens-100k is a well-known benchmark dataset with around 1000 users and just 1700 movies.
In your paper, the statistics say that MovieLens-100k has around 600 users and 9000 movies. This is wrong. The dataset you are referring to is not the well-known benchmark MovieLens-100k. Instead, it is a novel version of the dataset (not stable and subject to changes), that can be found here: https://grouplens.org/datasets/movielens/latest/.
This is just a typo on the name of the dataset. Researchers in the recommendation domain think of MovieLens as the benchmark, so I suggest changing the name for the next release of the article.
I am also writing this issue just to be sure that I have found the correct dataset.
Thank you in advance!