Closed ilchef closed 3 weeks ago
Hey! Thank you. :)
Love the idea but let me get back to you on the specific points. Busy weekend.
Hey! Had some time to think.
What do you think?
On moving to a tabular data format, I wonder whether you could have a series of CSVs as the master data (e.g. climbs.csv
, climbers.csv
and ascents.csv
) and then have a build script that ingested this data and output it in to a variety of formats. For example, the script could produce an sqlite database as one of the outputs.
The advantage of this would be it's easy to collaborate via the repo and PRs but people can easily consume the data in a format that works for them.
On the downside, enforcing relational constraints (foreign keys etc.) could get messy with the data in CSVs. I guess some build script checks would help here but it could be tricky to keep on top of errors as the feedback loop is a little slow.
@bourbonspecial I was thinking something very similar to you - 3 relational csvs. Not sure if we need to bother with formal keys or unique identifiers, as long as v17 climbers names are unique.
send_table: climber_name, climb_name, type (boulder or sport), date, confidence (in date - exact or approx)
climb_table: climb_name, country, crag, long, lat, grade
climber_table: climber_name, country, gender, year_of_birth
@9cpluss
human readable data is great - I think our mainstream options are json and csv.
Agree with your points on inefficiency of tabular data - my workaround is to store in relational tables (see above). Worth noting that the volumes of data we are working with will always be pretty tiny - inefficiency isn't going to have any performance issues at the number of v17 sends.
Understand that json has its benefits though. If we want to maintain json, I think it could be reshaped to capture all send info. And including a confidence field for send dates we arent confident with. I think a lot of this data (send dates) will never be perfectly known. We can only really know the day it reached the internet, in a lot of cases.
Hi @bourbonspecial and @ilchef, thanks for your suggestions and thoughts on this.
Just wanna be very open about the requirements for the data format before I get into the suggestions, because it explains why it is the way it is right now.
I do like the idea of capturing more data so I think I am willing to make some compromise with regards to 3. Having 3 relational CSVs would not be to bad.
But I think primary keys for climbers and climbs are required because names of climbs and climbers can be complicated at times (with odd spellings, nicknames, with or without "the" in front, variants on indicating sit starts etc). It's easier to look up a number, then to spell a climb and climber name correctly.
As mentioned, we don't have much data to deal with, so just loading the CSVs into dataframes should be fine. I already do that in the current version of the code. Don't think a sqlite database is necessary.
Basically we can just start with all climbs and climbers and number them from 1 to N (number of climbs/climbers) and then increase the number by one every time a new climb or climber is added and just have a simple check for that (increasing and unique).
In terms of CSVs and fields, I think @ilchef you almost nailed it already:
sends:
climber_id, climb_id, date, confidence (in date - exact or approx), given_grade, fa (first ascent indicator), video_link
climbs:
climb_id, name, type (boulder or sport), country, crag, long, lat, grade (this one could be derived from given_grades once we have them)
climbers:
climber_id, first_name, last_name, country, gender, year_of_birth
There are some fields that we don't have data for yet. I think it makes sense to reformat the data we already have first and update the code to use the reformated data. And then add the additional fields and collect the data. This could be done in steps as well.
awesome - Ive finished exams now, so I can get started
Great, let me know if you have any questions.
@9cpluss Ive committed climbers table to my branch: https://github.com/ilchef/hardest-climbs/tree/refactor
There's a pretty high missing rate for year of birth - but I guess we knew that was unavoidable. Noone without a wikipedia page has an easily accessible date of birth.
Im planning releases in 4 stages:
Let me know how many pull requests you want. From a hygiene perspective it might be easier to do piecemeal merges rather than everything at once
Hi @ilchef Thanks! Looking good so far. Couple bits of feedback:
_table
prefix from the csv. The name plus ending makes it pretty clear.For the releases: Are you thinking of also implementing the code that reads from the new CSV files instead of the JSON? I guess that really only makes sense once we have all 3 CSV files though. But I can implement that as well.
I think in terms of merge requests, I think we can do 3 individual merge requests for the CSVs. They can live side by side the JSON files for not until we refactor the code to use the CSVs. Then we can delete the JSON files with a “final” release.
I actually just added a new ascent today by a climber that has not been in the JSON yet. That is something we have to keep in mind. But once we have the first CSV I can also make sure to update it if necessary whenever I add new data until we get rid of the JSON files.
sure re: 1,3 and 4.
I added cl as prefix for primary key so that is coerced into a string by every application. If we store it as a number, people using data may accidentally give it numeric properties if they open it in excel or whatever (decimal places, different notation, etc.). I figure it doesn't need to be a number because noone is doing maths with it. The prefixes are also good to avoid mixing up different ids imo.
btw also just committed the route table, kept in special characters- I will remove "_table" from csv.
Will do a commit with 1,3 and 4 probably tomorrow or day after.
Great, thank you :)
Regarding the id prefix: yes, that is a valid point. We can keep the prefix.
just committed with 1 and 3. Im having trouble accessing that site so haven't added in the extra birth years.
im going to prioritise coding up the ascent table now - and once thats ready lining up the 4 PRs. will look at improving missing rate of birth year in a subsequent commit - as its hard to automate and not a crucial data element.
Sounds good! Yeah, I am not sure if one needs to log in to access all their data. But sounds good.
Just had a quick look at the data and noticed that I didn’t add country and area for the boulder “Adrenaline”. I have fixed that now. In your CSV you named the column “crag” but it’s not always the name of the crag. It’s somewhat inconsistent but I think “area” is still the better name.
Hi @ilchef just checking in if you're still working on it or if you need any help with it?
Will close this one for now. Let me know if you wanna work on it again.
Hi mate,
love your work.
I want to build a dashboard/interactive layer sitting on top of this.
I'd like to suggest the following changes to data for efficiency and capturing most information. I am happy to put these changes in a merge request if you're happy with them.
reformat dates Get all dates in YYYY-MM-DD format. Where we dont know a date definitely, (i.e. currently Jan-2021) we pick the earliest article about the send or instagram post, whichever came first, as a best-guess/placeholder. Can do more with data once dates are stored consistently.
refactor data from 1 entry = 1 route to 1 row = 1 send. Also proposing switch from json to tabular data. I think we could have richer data by retaining info about repeats (including send dates etc.), not just FAs. FA can be a derived field from this data (i.e. where min date where name == 'burden of dreams')
Im happy to do all the work here if you're comfortable. let me know!!