catalyst-cooperative / pudl-scrapers

Scrapers used to acquire snapshots of raw data inputs for versioned archiving and replicable analysis.
MIT License
3 stars 3 forks source link

Add a scraper for the EPA-EIA Crosswalk #20

Closed aesharpe closed 2 years ago

aesharpe commented 2 years ago

This is a draft PR so folks can follow along on my progress towards adding a EPA-EIA Crosswalk scraper. Note that there are several line changes because of the black formatter--not trying to be gnit-picky!

The scraper successfully downloads two files from the CAMD github page:

Questions:

Right now the scraper uses a direct link to download each of the files. I.e.:

1) Would it be more robust to start with the link to the main github page "https://github.com/USEPA/camd-eia-crosswalk" and work from there using xpaths to find the links to each of the files?

2) I decided against downloading the entire repo and then extracting the files I wanted, do you agree with this method?

zaneselvans commented 2 years ago

Hmm. In the event that the repo goes away, having the README and other documentation that's part of it would be very helpful for understanding how the CSVs were created and what they mean, so I think I'd lean toward grabbing the entire repo, which is available as a ZIP from: https://github.com/USEPA/camd-eia-crosswalk/archive/refs/heads/master.zip

zaneselvans commented 2 years ago

It might be worth doing a single bulk reformatting of everything with black in an isolated commit that we can add to the git-blame-ignore list. More generally we could bring over the pre-commit hooks from the cheshire repo. Let me know if you want help setting that stuff up.

aesharpe commented 2 years ago

, so I think I'd lean toward grabbing the entire repo

Ok that makes sense

aesharpe commented 2 years ago

@zaneselvans @cmgosnell Do you think I need to write a test for this? It's so strait forward I feel like it doesn't even need one.

zaneselvans commented 2 years ago

Regarding a test: You could also just run the scraper. The runner is on github and the data is on github and it's only 1MB so it will be super fast, and it'll catch when they (say) rename the master branch to main or change the repo in some other way.

aesharpe commented 2 years ago

Regarding a test: You could also just run the scraper. The runner is on github and the data is on github and it's only 1MB so it will be super fast, and it'll catch when they (say) rename the master branch to main or change the repo in some other way.

Do you mean that I should write a test that runs the scraper or just run the scraper as a test any time I want to see if it still works? There isn't a test for the census data either.

If I should write a test, should I make one of those html pages like the eia datasets?

aesharpe commented 2 years ago

Oh also all of the tests in this repo have the word test as a prefix rather than a suffix. EX: test_eia860.py vs. eia860_test.py. Should I change these? It seems like they are still able to run, at least in the terminal, with pytest...

zaneselvans commented 2 years ago

There's no test for the census data because there's only one file, census is good about stability, and the data only changes once every 10 years. But it's big and not within GitHub and we do almost nothing to process it.

I was suggesting that you actually run the scraper as a test, but now I'm recalling that all of the tests in here created by Pablo were unit tests with mocked webpages and yeah I don't think that's a thing you need to create, so don't worry about adding any tests in this case.

pytest will automatically discover test modules that match test_*.py or *_test.py In our other repos we've adopted the convention of *_test.py and have a pre-commit hook that enforces this convention. We haven't yet brought the pre-commit hooks into the scrapers / zenodo repos, so I would say don't worry about it so long as it's working with pytest. When I bring more CI stuff in here I'll get it all standardized.

zaneselvans commented 2 years ago

Though maybe I should get the CI in here soon since pre-commit-ci is trying to run an failing because it's not configured! 😆

aesharpe commented 2 years ago

Though maybe I should get the CI in here soon since pre-commit-ci is trying to run an failing because it's not configured! 😆

Maybe it makes sense to make this another PR and close this one out.