EXACTsports / roster-project

Experimenting with scraping college team rosters
0 stars 0 forks source link

Review the brief and ask any questions on this project #1

Open barrytarter opened 2 years ago

barrytarter commented 2 years ago

@hardcommitoneself if you have any technical questions, feel free to post them in this issue here as should allow us to document the development process better.

hardcommitoneself commented 2 years ago

I'd like to know more about the spreadsheet and roster table. Please send me sample spreadsheet file.

Schema::create('rosters', function (Blueprint $table) {
            $table->id();
            $table->string('university');
            $table->string('url');
            $table->string('sport');
            $table->timestamps();
        });
hardcommitoneself commented 2 years ago

Should we use TALL stack in our app?

barrytarter commented 2 years ago

@hardcommitoneself

Here is what @edgrosvenor shared with me -- this will mainly be all back-end functionality so feel free to use whatever you prefer, e.g. browsershot, curl, even python is ok, etc. The early output might be CSVs of the profile data just to check it (e.g. name, position, year in school, etc).

If you are planning to do a front-end piece, TALL would be useful.

Does that make sense?

hardcommitoneself commented 2 years ago

Thanks for letting me know, @barrytarter . It makes sense. So first I will scrap basic profile data(name, position, year etc) from the url provided from excel. I am not sure if you did check slack message. I mentioned that I will use Roach PHP to scrap data from the other sites.

barrytarter commented 2 years ago

@hardcommitoneself great, yes, this is best place to reach both me and Ed!

hardcommitoneself commented 2 years ago

@barrytarter

I just finished import excel feature and now I am gonna build scrapper. So, after import excel file, should our scrapper work automatically? or we need to handle it manually?(start scrapping button something like that)

barrytarter commented 2 years ago

@hardcommitoneself For now, whatever is easiest to get a 'test' version live that successfully pulls and stores data. If @edgrosvenor has any tips, he'll share them here as well.

You'll need to create unique decision rules for pulling the roster data as some rosters are very similar and others are different, e.g. these two are sites that use "Sidearm Sports" templates: https://acusports.com/sports/womens-volleyball/roster https://asugrizzlies.com/sports/mens-soccer/roster

These ones also use Sidearm sports, but a different template I think: https://aupanthers.com/sports/mens-soccer/roster https://bamastatesports.com/sports/womens-volleyball/roster https://auwolves.com/sports/mens-soccer/roster

These are both from Presto Sports templates, but the templates look different: https://goamcats.com/sports/msoc/2017-18/roster https://www.sunyadktimberwolves.com/sports/msoc/2017-18/roster

hardcommitoneself commented 1 year ago

@barrytarter

In my opinion, how about checking the number of tr of all tables in each page? So, as I noticed so far, it seems that there is only one table which have over many items(I think that is what we want).

barrytarter commented 1 year ago

@hardcommitoneself I like that approach. We might need a way to decipher the type of content listed.

e.g. grade level (aka "graduation Year") values could be categorized by word, e.g. 'freshman', sophomore, junior, senior? I look forward to seeing how you figure it out!

hardcommitoneself commented 1 year ago

@barrytarter

I just noticed that some rosters have no tables(instead list). https://www.artuathletics.com/sports/womens-volleyball/roster I think we need to build logic for the ul list.

hardcommitoneself commented 1 year ago

image It is what I just reached out to now. I think it will be base of our scrapper. Please check it out and let me know feedback.

hardcommitoneself commented 1 year ago

image

Please take a look at this screenshot. You can notice that the Year field. The filed's value is different with the others. How can I convert the numbers(1, 3, etc) to real year value(Fr., Sr etc)?

barrytarter commented 1 year ago

@hardcommitoneself here is one possible guide on how to map the data: https://docs.google.com/spreadsheets/d/1QBCGpvXjoDAH50wQTTnYLj5cWzb3TlXWWUPn-g3kk78/edit?usp=sharing.

Specifically for the numbers, it could map as 1 = Freshman, 2 = Sophomore, 3 = Junior; 4 = Senior; 5 = Senior; 6 = Senior.

hardcommitoneself commented 1 year ago

@barrytarter

https://www.loom.com/share/262f7d29525f45eba0caa4e8455a965d Please check this video. And give me feedback.

hardcommitoneself commented 1 year ago

@barrytarter @edgrosvenor

Regarding the extra field of athlete table, should we add the follow fields to it? image

barrytarter commented 1 year ago

@hardcommitoneself ,

Thanks for sharing. Can we store both as text for now? The first is a height field and the second is where they played in high school. These are pretty common, so good to collect.

barrytarter commented 1 year ago

@hardcommitoneself will you be able to begin developing the crawler that will find the missing Twitter and Instagram IDs?

Step 3 in https://docs.google.com/document/d/1YmfAFYu4Cyl99ninB4KAeML4y-nmRW0gzI6Xeydg_2g/edit?usp=drivesdk

Can you get a v1 of that part ready by Wednesday?

edgrosvenor commented 1 year ago

@hardcommitoneself Go ahead and add any data that you think might be valuable as key / value pairs in the extra column. While you're at it, enable this package for that column: https://github.com/spatie/laravel-schemaless-attributes That will allow you to do things like $athlete->extra->set('height', '5\'9"');. I think maybe I've included the package in composer (maybe not), but I haven't added the trait to the model.

hardcommitoneself commented 1 year ago

@barrytarter @edgrosvenor

Regarding the second crawler, I think we can use opendorse.com to scrap our athlete's contact info. The following is just my opinion.

  1. First, we need to search university by university name https://opendorse.com/searchshowAthletesNotOptedInToDeals=true&showUnclaimedAccounts=true&term=Abilene+Christian+University
  2. Then we need to go to relevant university page https://opendorse.com/abilenechristian-wildcats
  3. And we need to filter by sport https://opendorse.com/abilenechristian-wildcats?sports=Soccer
  4. That's it, we should find our athletes in the page. https://opendorse.com/profile/ellen-joss?from=abilenechristian-wildcats

That's it. I am not sure this approach is working for all rosters. So I just want to test with real links.

hardcommitoneself commented 1 year ago

@barrytarter @edgrosvenor

I wrote my suggestion below. I think we'd better to use Google search engine by using name, sport, college for our contact crawler. I checked manually with many athletes and it looked nice.

example search query - google.com/search?q=twitter+Nicole+Barham+ACU+soccer https://www.google.com/search?q=instagram+Nicole+Barham+ACU+soccer

Please take a look at it and give me your idea.

barrytarter commented 1 year ago

Sure, we can test that out and see how the data looks.

hardcommitoneself commented 1 year ago

@barrytarter @edgrosvenor

Hi, Hope you are having nice weekend!

Please take a look at this video. https://www.loom.com/share/96661444867a4df98f6fdef1756662e3 You can notice that this scrapper is working well. Give me your feedback.

Sorry to bother you. :)

barrytarter commented 1 year ago

@hardcommitoneself here are some more good links of rosters. Can you check to see how many profiles you can pull from the rosters (100%?), how much data is filled in for position, height, weight, grad year, from?, how many twitter links you get? how many instagram? How many opendorse?

https://artuathletics.com/sports/mens-soccer/roster https://asugrizzlies.com/sports/mens-soccer/roster https://aupanthers.com/sports/mens-soccer/roster https://adrianbulldogs.com/sports/msoc/roster  https://www.albertusfalcons.com/sports/msoc/2022-23/roster https://gobrits.com/sports/mens-soccer/roster https://albrightathletics.com/sports/mens-soccer/roster https://alfredstate.prestosports.com/sports/msoc/2022-23/roster https://gosaxons.com/sports/mens-soccer/roster https://alicelloydeagles.com/sports/msoc/2022-23/roster https://www.ahcbulldogs.com/sports/msoc/2022-23/roster https://www.allegany.edu/athletics/mens-soccer.html https://alleghenygators.com/sports/mens-soccer/roster https://sccstorm.com/sports/msoc/2022-23/roster https://almascots.com/sports/msoc/2022-23/roster https://auwolves.com/sports/mens-soccer/roster https://www.aicyellowjackets.com/sports/msoc/2022-23/roster https://www.arcbeavers.com/sports/msoc/2022-23/roster

hardcommitoneself commented 1 year ago

@barrytarter @edgrosvenor

Please take a look at the following. https://www.loom.com/share/0708dffa27714d9eb3f0ac3072bb77c7 I implemented 100% automation for scrapping twitter id for test. I think this scrapper got almost twitter ids, so please check it manually. Then give me feedback. I already implemented opendorse logic last week, so I need to implement instagram logic now.

barrytarter commented 1 year ago

@hardcommitoneself could we add a method that would allow us to get this person's instagram and twitter? Caleb Kendra at 0:57 you'll see his name in https://www.loom.com/share/0708dffa27714d9eb3f0ac3072bb77c7. e.g. https://www.instagram.com/c_kendra2/

hardcommitoneself commented 1 year ago

@barrytarter

So, do you want to get full twiiter link of atheltes like https://www.instagram.com/c_kendra2/ ?

barrytarter commented 1 year ago

@hardcommitoneself yes, we want the twitter, instagram, opendorse links for all athletes in the crawler.

hardcommitoneself commented 1 year ago

@barrytarter

OK, as we discussed before, we can not get many athlete's social links since most of them don't have it. Anyway please take a look at the following. image

barrytarter commented 1 year ago

@hardcommitoneself yes, if it doesn't exist, we definitely can't store one.

Caleb Kendra does have one but we didn't store it -- how do we fix that?

hardcommitoneself commented 1 year ago

@barrytarter

I think we can store it. What's the problem? image This is the structure of athlete table.

barrytarter commented 1 year ago

Great! Why didn't it store previously?

hardcommitoneself commented 1 year ago

@barrytarter

Please take a look at it. I implemented opendorse scrap method, so we can get not only opendorse link but also twitter or instagram link from there.
https://www.loom.com/share/2b18bd1d24a04f5bbdcd018221ab7a4a