PaulMcInnis / JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.
MIT License
1.81k stars 212 forks source link

test_delay.py, changes to delay.py and future testing #71

Closed thebigG closed 4 years ago

thebigG commented 4 years ago

Testing for delay.py

Hello everyone,

Hope you are all doing well!

Description

I have added testing for delay.py. Just as before, I tried my best to isolate each unit so that when a test fails, it's easy to spot the problem. I also made changes to delay.py. Specifically I removed a sys.exit() call on delay_alg after an exception, which didn't make sense because Exceptions(unless caught) throw away the flow of execution anyway; see b8a05265d3077bfff95ece3e2477dd8e1dec8f3d for details on this. I also removed a try/except block that wrapped most of delay_alg, please let me know if I should add this back. I saw no reason for it being there, but maybe I missed something. See 134cf1646f67242b33b6146097bb4d6939b39608 for details on this. I also fixed an issue on 96468e6eda447100eaa78b7a8164b13e76603b1c. Hopefully all of this is clear enough.

A Few Thoughts About the Future on Testing It's very likely that in the following weeks, I'll try to tackle the rest of testing. This is of course the heavy/more complex modules that do the actual scraping. With the modules I have tested, it's been pretty straightforward because all of those modules are mostly functional and there is no such thing as classes and complex data types, or even any concurrency. Because of this, I really would appreciate any input you guys might have! Specifically I am concerned about a few things. The first thing is, how much testing is it "actually" needed to test the entirety of JobFunnel. This is something we could probably figure out as we go along. However, the reason why I'm concerned about this is because technically speaking we test JobFunnel in TravisCI, which is pretty great! But of course there are plenty of functions in those modules(Monster.py, Glassdoor.py, etc) that can be tested in isolation. Another thing that comes to mind is possible big changes that come in the future for the codebase. Specifically I know that people have posted issues about Internationalization and supporting more sites. I wonder if anyone has given any of these two problems any thought and/or is even prototyping any of it? Specifically, will there be architectural changes to the codebase if we added another site? Even more specifically, I mean do you see these kinds of changes as just being "another scraping module" like another_job_site.py or does that look a little different, perhaps because of internationalization? I hope my concerns are clear enough. I know that these are complex problems because they have many ways of going about them, but I think it will help me or anyone that writes tests to be aware of any thoughts/views you guys might have on this.

Cheers!

Please also include relevant motivation and context. List any additional libraries that will be affected. List any developers that will be affected or those who you had merge conflicts with.

Context of change

Please add options that are relevant and mark any boxes that apply.

Type of change

Please mark any boxes that apply.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.

Checklist:

Please mark any boxes that have been completed.

studentbrad commented 4 years ago

This is great! I agree with the changes in the delay config. Perhaps someone would like to comment on that further. The testing of the rest of the code is definitely going to be difficult. The project could change a lot in the future but testing should always reflect what has been implemented now. Changes in the future to the codebase will simply also have to change the testing. If we do support more links and sites, sites will be added as their own files, unless we find a smarter approach like using some web crawling technique. Alternate links for existing sites will have to be implemented under the correct file like glassdoor.py.

The reason I made _testcountries.py an X-pass test is because there's no telling which cities we will find on sites which means some are likely to fail. I suggest a similar approach for testing scraping. The sites could change and that is beyond our control, we should be notified when scraping no longer works for specific sites, but our build should not fail as a result.

markkvdb commented 4 years ago

Excellent contribution @thebigG! The functionality of the testing functionality is definitely non-trivial, so it's great that you could test for that it actually works as expected.

Your method for testing the random delay functionality is quite original with the fake uniform sampling. Another way this is generally done is with seed values. Setting your random number generator with a fixed seed value will always give you the same random samples. Right before you sample the random numbers you would call random.seed(<your_seed>) and the resulting random draw is always the same.

I totally agree with you with testing the rest of the code base. I think it's good that you started thinking about this too because I also believe that the code base will face some restructuring in the future. It's great that the scrapers for the different sources already inherit from a standard interface but I don't know if the interface is complete right now. Especially with an eye for the most requested feature: internationalisation.

I have not had a lot of time lately because I'm finishing my master thesis but remember thinking that the functions of the actual scrapers are often too long and can be abstracted more. However, when we do make these changes I think it's good to have a good overview of what changes are necessary if we want to support more countries. Otherwise, we might refactor our code but without the necessary changes for this support for other countries. In short, it's a difficult problem and I think we would have to think about the structure first before starting to implement. these changes.

In light of all of this, maybe we should discuss some of these design issues on another channel like Slack? What do you guys think?

PaulMcInnis commented 4 years ago

@markkvdb I think it's a good idea for us to have a discussion yes, probably in a slack or google chat would be best. My time is also limited rn.

Slack looks good but it does limit conversations to 10k messages in unpaid so it'd have to be temporary.

I'm open to suggestions though, perhaps we can use/abuse the github wiki to contain technical design ideas? 1 page per idea kinda thing?

thebigG commented 4 years ago

It would be really nice to have a medium through which we can discuss technical/design details such as Slack. Any thoughts on IRC? I believe one can create a free channel on a server like freenode. I do like the idea of a wiki too, since we could definitely draft ideas on the wiki and work on them at a design/theoretical level. I guess I'm saying it'd be nice to have both of those things for the project. The wiki can work as a platform for people to present ideas and the chat could be a way of having more interactive discussions about the project. Would having the two of them be overkill?

thebigG commented 4 years ago

I hope my refactoring makes sense. I didn't test the entire module in one test function because I didn't want to cram everything in one place since that might sacrifice some readability.

PaulMcInnis commented 4 years ago

as for chat, feel free to create an IRC chat on freenode - alternatively we can use a discord or similar.

Wiki being used for design is fair use, feel free to use it for that :)