cyclestreets / cyclescape

Cyclescape - cycle campaign group toolkit
https://www.cyclescape.org/
MIT License
33 stars 15 forks source link

Issue elimination - database migration #1011

Open mvl22 opened 2 years ago

mvl22 commented 2 years ago

We discussed in depth the data migration for the elimination of issues to leave discussions only.

NB: Issue is a container but is not actually heavily used in the logic, so we can leave it lying about in parallel if necessary.

For each thread:

At end of message could inject a new paragraph: [This message was auto-created from an overall description of this issue that was created in an earlier version of this website.]

If you can detect identical content between issue and first message then don't add the first message.

Data cleansing:

mvl22 commented 2 years ago

Just to say we now have a volunteer for the tagging work, so if you can generate the data for this I can get them started.

They'll also be working on dealing with variant tagging, which would be good to clean up. Is it easy to generate a list of all tags and their frequency? That should hopefully enable us to determine similar tags.

nikolai-b commented 2 years ago

To get the issues with threads where neither the issue or the thread have tags I've used:

File.open("issues_without_tags.txt", "w") do |file|
  Issue.joins(:threads).where(<<~SQL
    not exists (select 1 from issue_tags where issue_tags.issue_id = issues.id) 
    and not exists (select 1 from message_thread_tags where message_thread_tags.thread_id = message_threads.id)
  SQL
  ).distinct.select("issues.id, issues.title").find_each { |iss| file << "https://www.cyclescape.org/issues/#{iss.to_param}\n" }
end

issues_without_tags.txt

I'd like to add a validation to ensure all future message threads have tags so we don't get new ones. Given the issue might be tagged I could do something like pre-fill the new thread's tags with the issues ones if you want but also starting with a blank doesn't seem unreasonable.

For issues with multiple threads:

headers = ["Issue URL", "Tags on the issue only"]
CSV.open("issues_with_multiple_threads.csv", "w", write_headers: true, headers: headers) do |csv|
  Issue.where(id: iss_ids).left_joins(:tags).group(:id, :title).select("issues.id, title, json_agg(tags.name) as tag_names").find_each { |iss| csv << ["https://www.cyclescape.org/issues/#{iss.to_param}", iss.tag_names] }
end;

issues_with_multiple_threads.csv

Finally the top tags:

headers = ["Tag name", "Tag frequency (including issues, threads and library items)"]
CSV.open("tag_with_freq.csv", "w", write_headers: true, headers: headers) do |csv|
  Tag.top_tags_fresh(100000).map { |tg| csv << [tg.name, tg.tag_count] }
end

tag_with_freq.csv

mvl22 commented 2 years ago

Could we get the counts for those added to the stats page somewhere so I can keep an eye on that as we work through it at this end? I donโ€™t mind running the queries on the DB manually from time to time to get the data output but a quick view for the three stats would be helpful.

I'd like to add a validation to ensure all future message threads have tags so we don't get new ones.

Yes, definitely. Soon as we have the current data cleaned up, we should apply that to editing also so that the constraint is in before we do the main site migration.

Finally the top tags

I see we have some tags with zero references to them. I suggest we zap those as they aren't useful and will just create autocomplete noise.

nikolai-b commented 2 years ago

I've added a new admin pages with the untagged issues and issues with multiple threads at /admin/home

I'll add the validation once all issues have tags otherwise it will get confusing.

I've removed the tags with no references, I'll make an issue to check this every so often.

nikolai-b commented 2 years ago

Issues have voting :+1: and :-1: whereas threads have favourites :star: . We have 912 instances where a user has :+1: or :-1: and issue. Do we just ignore these? It matters for "popular" part of the new discussions design.

mvl22 commented 2 years ago

Issues have voting ๐Ÿ‘ and ๐Ÿ‘Ž whereas threads have favourites โญ . We have 912 instances where a user has ๐Ÿ‘ or ๐Ÿ‘Ž and issue. Do we just ignore these? It matters for "popular" part of the new discussions design.

Popular discussions is simply the discussions with the most messages in.

Favourites are purely just a booking mechanism for the user. They are not used in determining popularity. Other users cannot see who has favourited what.

Issues have voting, but you can basically entirely ignore that now. That data was never very interesting and will just fade away.

mvl22 commented 2 years ago

To get the issues with threads where neither the issue or the thread have tags I've used

Action: Nikolai to make an admin page to help this be worked through more quickly.

mvl22 commented 2 years ago

To get the issues with threads where neither the issue or the thread have tags

/stats/issues_untagged

This report now has zero hits, so it would be safe to start enforcing at the database level that an issue/thread must have tags.