Wikidata / soweego

Link Wikidata items to large catalogs
https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego_2
GNU General Public License v3.0
97 stars 9 forks source link

Migration to NoSQL database #383

Open MaxFrax opened 4 years ago

MaxFrax commented 4 years ago

At the time being, we import all the targets into a MariaDB database. From a technical viewpoint, switching to a documental database like MongoDB would hold many advantages. Here's a list of them from the more important to the less (in my opinion):

  1. In workflow.py we perform some joins to gather all the information for an entity in a target. With a documental database, it wouldn't be necessary.
  2. In workflow.py 'extract_features', we already check if the columns are there. The same check would be done in a documental database.
  3. We tried to find a common schema among all the data sources and we failed. Introducing a documental database would save a lot of space spent on null fields and short words.
  4. A documental database would let us save strings of variable length, fixing all the errors in the import phase due to fields too small.
  5. There would be more flexibility on adding data available uniquely on a single data source.

It would be necessary to be consistent with the ontology mapping while we import the data, but nothing new under the sun.

However, the only obstacle I see is from an infrastructure viewpoint. I wasn't able to find anything about documental databases hosting on Wikitech. We should probably ask them. Maybe @marfox is aware of something.