edgeryders / discourse

The discourse.org forum software, with all the modifications as used on edgeryders.eu.
https://edgeryders.eu
GNU General Public License v2.0
8 stars 2 forks source link

Map content to the right Discourse categories and tags on import #5

Closed tanius closed 7 years ago

tanius commented 7 years ago

We have the problem of sorting in all the edgeryders.eu content of the old Drupal platform into Discourse – means, assigning the right Discourse tags and categories. After going live with the Discourse platform, we will have to finalize this categorizing and tagging over time. But for starters, and because it's much more efficient, the first rough mapping should be done by the import script.

The mapping task of the import script. The basic requirement for the script is to allow defining Discourse target categories and target tags for content associated with a specific Drupal node.

These definitions could simply be written into a hashtable in Ruby code, or into a JSON file. Drupal nodes are identified by node ID. Discourse tags and categories would be identified preferably by name in this data structure, and would be created on the fly while importing content. While importing a piece of content, the script would have to determine if it is associated with another Drupal node, and then look what Discourse category (only one!) and tags (one or multiple!) to assign because of that association. To determine if a piece of content is associated with a Drupal node, the following two mechanisms have to be considered:

In addition, all "free tags" on content (using the Drupal taxonomy system) should be imported as tags on that content into Discourse. However, no additional categories would have to be created by the import script apart from those defined in the mapping definitions mentioned above.

And that should cover what we need. This mechanism will be sufficient for sorting in all the content initially. A common use case will be, for example, to assign a certain Discourse category to all content coming from a certain Drupal group, and in addition to assign a tag like project-original_group_name. This way, content originating from one group is still kept together by the tag (which can be followed and tracked in Discourse just like groups), even though there will be a lot of other content in the Discourse category to which this group content is imported.

albertocottica commented 7 years ago

Possibilities:

  1. Vector space model
  2. (This one's cool) entity extraction. A friend works at the company making a tool that maps a text onto the Wikidata entity graph to extract entities from a text. Try the demo: go here and paste the text of an Edgeryders post.

Others?

tanius commented 7 years ago

@albertocottica , you seem to refer to some sort of automatic tagging system? That would go into a different issue, if we need it. This one is simply about utilizing the structure that we have in the Drupal site (groups, challenges) and mapping that to structure in the Discourse site.

tanius commented 7 years ago

We decided to make a simpler implementation as follows:

  1. The import script will assign every Drupal node to one Discourse category, based on either (1) the challenge referenced by challenge response content, or else (2) the first (=main) Organic Groups group referenced by any other node. All categories created by the import script will be first-level categories in Discourse, and mostly correspond to our concept of "challenges".

  2. A Discourse admin (with the help of the Rails console if needed) will then do the following once, after going live with the website:

    • Rename the auto-created Discourse categories to have nice, short names.
    • Tag all content in an auto-created Discourse category with the same tag, indicating the project they belonged to (such as "Edgeryders CoE", "Open Village MENA" etc.). Will be used to allow navigation by project in past content, as project-based main categories will be dissolved when completing a project.
    • Dissolve auto-created Discourse categories with very few (<30) pieces of content in them, sorting in the content into other categories as fit.
    • Sort the auto-created Discourse categories in as sub-categories of either (1) main categories created manually for currently active projects, (2) the "Completed Challenges" category.
tanius commented 7 years ago

Implemented now, in the simpler version as mentioned in the last comment above.