Suggestions for refactoring data (happy to help!)

ilchef commented 5 months ago

Hi mate,

love your work.

I want to build a dashboard/interactive layer sitting on top of this.

I'd like to suggest the following changes to data for efficiency and capturing most information. I am happy to put these changes in a merge request if you're happy with them.

reformat dates Get all dates in YYYY-MM-DD format. Where we dont know a date definitely, (i.e. currently Jan-2021) we pick the earliest article about the send or instagram post, whichever came first, as a best-guess/placeholder. Can do more with data once dates are stored consistently.
refactor data from 1 entry = 1 route to 1 row = 1 send. Also proposing switch from json to tabular data. I think we could have richer data by retaining info about repeats (including send dates etc.), not just FAs. FA can be a derived field from this data (i.e. where min date where name == 'burden of dreams')

Im happy to do all the work here if you're comfortable. let me know!!

9cpluss commented 5 months ago

Hey! Thank you. :)

Love the idea but let me get back to you on the specific points. Busy weekend.

9cpluss commented 5 months ago

Hey! Had some time to think.

I get the wish to have proper dates for all ascents. But I also don’t wanna present dates to visitors that aren’t true. I think a compromise here would be to use your suggestion but make it clear with a footnote or something else that the real date is unknown and that the date shown is kind of a placeholder that is derived from when the ascent was first announced.
Yes, the JSON format has definitely reached its limits. I previously tried using a database as well but liked the open format of having just a text file that is human readable. The problem with tabular data is obviously repetition of information and having to copy a lot of information every time a new entry is made. Unless we mimic a relational database model with multiple tabular data files. Additional information (date of other ascents, personal grade) would of course be very useful. The JSON structure could also be adjusted to have a list of ascents with name, date, grade, video link for each instead of the flat structure. I think I would actually prefer that for now.

What do you think?

bourbonspecial commented 5 months ago

On moving to a tabular data format, I wonder whether you could have a series of CSVs as the master data (e.g. climbs.csv, climbers.csv and ascents.csv) and then have a build script that ingested this data and output it in to a variety of formats. For example, the script could produce an sqlite database as one of the outputs.

The advantage of this would be it's easy to collaborate via the repo and PRs but people can easily consume the data in a format that works for them.

On the downside, enforcing relational constraints (foreign keys etc.) could get messy with the data in CSVs. I guess some build script checks would help here but it could be tricky to keep on top of errors as the feedback loop is a little slow.

ilchef commented 5 months ago

@bourbonspecial I was thinking something very similar to you - 3 relational csvs. Not sure if we need to bother with formal keys or unique identifiers, as long as v17 climbers names are unique.

send_table: climber_name, climb_name, type (boulder or sport), date, confidence (in date - exact or approx)
climb_table: climb_name, country, crag, long, lat, grade
climber_table: climber_name, country, gender, year_of_birth

ilchef commented 5 months ago

@9cpluss

human readable data is great - I think our mainstream options are json and csv.

Agree with your points on inefficiency of tabular data - my workaround is to store in relational tables (see above). Worth noting that the volumes of data we are working with will always be pretty tiny - inefficiency isn't going to have any performance issues at the number of v17 sends.

Understand that json has its benefits though. If we want to maintain json, I think it could be reshaped to capture all send info. And including a confidence field for send dates we arent confident with. I think a lot of this data (send dates) will never be perfectly known. We can only really know the day it reached the internet, in a lot of cases.

9cpluss commented 5 months ago

Hi @bourbonspecial and @ilchef, thanks for your suggestions and thoughts on this.

Just wanna be very open about the requirements for the data format before I get into the suggestions, because it explains why it is the way it is right now.

Open (everybody can read and edit)
Human readable (ideally)
It needs to be easy for me to edit/add because I am the main contributor and often do it on my phone

I do like the idea of capturing more data so I think I am willing to make some compromise with regards to 3. Having 3 relational CSVs would not be to bad.

But I think primary keys for climbers and climbs are required because names of climbs and climbers can be complicated at times (with odd spellings, nicknames, with or without "the" in front, variants on indicating sit starts etc). It's easier to look up a number, then to spell a climb and climber name correctly.

As mentioned, we don't have much data to deal with, so just loading the CSVs into dataframes should be fine. I already do that in the current version of the code. Don't think a sqlite database is necessary.

Basically we can just start with all climbs and climbers and number them from 1 to N (number of climbs/climbers) and then increase the number by one every time a new climb or climber is added and just have a simple check for that (increasing and unique).

In terms of CSVs and fields, I think @ilchef you almost nailed it already:

sends:
climber_id, climb_id, date, confidence (in date - exact or approx), given_grade, fa (first ascent indicator), video_link

climbs:
climb_id, name, type (boulder or sport), country, crag, long, lat, grade (this one could be derived from given_grades once we have them)

climbers:
climber_id, first_name, last_name, country, gender, year_of_birth

There are some fields that we don't have data for yet. I think it makes sense to reformat the data we already have first and update the code to use the reformated data. And then add the additional fields and collect the data. This could be done in steps as well.

ilchef commented 5 months ago

awesome - Ive finished exams now, so I can get started

9cpluss commented 5 months ago

Great, let me know if you have any questions.

ilchef commented 5 months ago

@9cpluss Ive committed climbers table to my branch: https://github.com/ilchef/hardest-climbs/tree/refactor

There's a pretty high missing rate for year of birth - but I guess we knew that was unavoidable. Noone without a wikipedia page has an easily accessible date of birth.

Im planning releases in 4 stages:

1. climbers table
1. route table - this can be refactored from your json quite painlessly
1. ascent table
1. data quality checks. This can be a script we run prior to committing updates to data - could check for things like duplicate primary keys, nulls, climbers with similar names, reconciling previous commits, etc. thinking either yaml or just basic python script for this
Let me know how many pull requests you want. From a hygiene perspective it might be easier to do piecemeal merges rather than everything at once

9cpluss commented 5 months ago

Hi @ilchef Thanks! Looking good so far. Couple bits of feedback:

You can remove the _table prefix from the csv. The name plus ending makes it pretty clear.
I think climber_id can just be a number. It doesn’t need the prefix cl. So 1, 2, …97 is perfectly fine.
You have normalised the names which is useful for constructing URLs but by doing that you have also removed some “special” characters from some names. For example clement lechaptois is actually Clément. I would like to keep the correct spelling of all names.
It looks like 8a.nu has ages for some of the people I just checked. I assume underlying is the year of birth. For example: https://stage.8a.nu/user/nicolas-pelorson

For the releases: Are you thinking of also implementing the code that reads from the new CSV files instead of the JSON? I guess that really only makes sense once we have all 3 CSV files though. But I can implement that as well.

I think in terms of merge requests, I think we can do 3 individual merge requests for the CSVs. They can live side by side the JSON files for not until we refactor the code to use the CSVs. Then we can delete the JSON files with a “final” release.

I actually just added a new ascent today by a climber that has not been in the JSON yet. That is something we have to keep in mind. But once we have the first CSV I can also make sure to update it if necessary whenever I add new data until we get rid of the JSON files.

ilchef commented 5 months ago

sure re: 1,3 and 4.

I added cl as prefix for primary key so that is coerced into a string by every application. If we store it as a number, people using data may accidentally give it numeric properties if they open it in excel or whatever (decimal places, different notation, etc.). I figure it doesn't need to be a number because noone is doing maths with it. The prefixes are also good to avoid mixing up different ids imo.

btw also just committed the route table, kept in special characters- I will remove "_table" from csv.

Will do a commit with 1,3 and 4 probably tomorrow or day after.

9cpluss commented 5 months ago

Great, thank you :)

Regarding the id prefix: yes, that is a valid point. We can keep the prefix.

ilchef commented 4 months ago

just committed with 1 and 3. Im having trouble accessing that site so haven't added in the extra birth years.

im going to prioritise coding up the ascent table now - and once thats ready lining up the 4 PRs. will look at improving missing rate of birth year in a subsequent commit - as its hard to automate and not a crucial data element.

9cpluss commented 4 months ago

Sounds good! Yeah, I am not sure if one needs to log in to access all their data. But sounds good.

Just had a quick look at the data and noticed that I didn’t add country and area for the boulder “Adrenaline”. I have fixed that now. In your CSV you named the column “crag” but it’s not always the name of the crag. It’s somewhat inconsistent but I think “area” is still the better name.

9cpluss commented 2 months ago

Hi @ilchef just checking in if you're still working on it or if you need any help with it?

9cpluss commented 3 weeks ago

Will close this one for now. Let me know if you wanna work on it again.

9cpluss / hardest-climbs

Suggestions for refactoring data (happy to help!) #39