DemocracyClub / yournextrepresentative

👥 A website for crowd-sourcing structured election candidate data
https://candidates.democracyclub.org.uk
GNU Affero General Public License v3.0
21 stars 27 forks source link

Clean instagram input #2400

Open VirginiaDooley opened 3 months ago

VirginiaDooley commented 3 months ago

This work cleans instagram entries to the Person Identifier form so that the result is a username with the value of the instagram url. I've also updated the people/README.md to list and describe the input and output of all the person identifiers. Finally, I've included a management command to reformat existing usernames to urls.


symroe commented 3 months ago

I think the regexs might need some work here.

First off, we have x.com in there (I guess copied from the twitter validator).

I think there's also instagr.am that we need to match (I don't know how much this is used), and matching .* at the start might be a problem that applies to Twitter too.

Can you use url parse (as we've done elsewhere) to match the domain to one of [ "instagram.com", "www.instagram.com", "instagr.am", "www.instagr.am", ]?

This is the first pass.

Then for usernames, we need to take the path part of the parsed value. I found this regex for validating usernames:

https://blog.jstassen.com/2016/03/code-regex-for-instagram-username-and-hashtags/

VirginiaDooley commented 3 months ago

I think the regexs might need some work here.

First off, we have x.com in there (I guess copied from the twitter validator).

I think there's also instagr.am that we need to match (I don't know how much this is used), and matching .* at the start might be a problem that applies to Twitter too.

Can you use url parse (as we've done elsewhere) to match the domain to one of [ "instagram.com", "www.instagram.com", "instagr.am", "www.instagr.am", ]?

This is the first pass.

Then for usernames, we need to take the path part of the parsed value. I found this regex for validating usernames:

https://blog.jstassen.com/2016/03/code-regex-for-instagram-username-and-hashtags/

I've added a fixup commit with these suggested changes and made it clear to the user that a URL not a username is expected on the form.

symroe commented 3 months ago

The regex is looking good now.

If the idea is to store the username only, then we shouldn't make the change to call the field "Instagram URL". As things stand, someone will enter a URL and it will be replaced with a username. But if they enter a username (or edit a profile with just a username in the field), it will fail validation due to the field not containing one of the valid domains.

We would also need to update WCIVF and all previous values in this field to only expect the username.

It might be better to standardise the URL with a username, rather than changing it from being a URL

VirginiaDooley commented 2 months ago

Couple of inline comments.

I think we're going to need to think about validating all existing data in this field somehow. If we deploy this code as-is, some profiles might not be saveable because the value in this field isn't valid. Users can just remove it, but we should clean this up before that happens.

The quickest thing to do, I imagine, is to run the regex over a CSV export with the instagram URL added. We can then see if there are any profiles with bad values and manually clean them up.

We'll want to do the same for LinkedIn URLs (as per the other open PR), so I expect running one report for both makes sense.

We have 4,227 instances of an instagram PersonIdentifier. All but one username passed the regex validation. However, we have approx 60 values stored as "@username" rather than a url. I could alter the script I used to detect dead links to find and replace those values with valid urls?