Seperate tests according to server dependence

Christovis commented 3 years ago

This PR contains the following changes:

Changes to the /tests/ folder structure as shown here
the above suggests seperating Listserv tests into:
- a default test suit that is based on data located in /tests/data/
- a optional tests regarding direct webscraping and take longer as it relies on server connection
adds to_conversationkg_dict() method to ListservList and ListservArchive classes which creates the required data structure for constructing knowledge graphs with ConversationKG

npdoty commented 3 years ago

I get 3 errors on running these tests, all versions of this:

___________________________________ ERROR at setup of TestListservArchive.test__to_dict ___________________________________

self = <tests.test_listserv.TestListservArchive object at 0x7f90abadde10>

    @pytest.fixture(name="arch", scope="session")
    def test__from_listserv_directory(self):
        arch = ListservArchive.from_listserv_directory(
            name="3GPP",
            directorypath=project_directory + "/tests/data/3GPP/",
        )
        assert arch.name == "3GPP"
        assert len(arch) == 2
>       assert len(arch.lists[0]) == 325
E       assert 29 == 325
E        +  where 29 = len(<bigbang.listserv.ListservList object at 0x7f90ac7eeb90>)

tests/test_listserv.py:167: AssertionError

The tests are still on the slow side, and the number of files involved is probably larger than necessary for minimal tests, but it's still a huge improvement on speed. (If the test failure indicates that one version expects to crawl over 10 times as many lists, then I suggest going with the 29 list rather than the 325 version.) If we can fix these failures, then I think it's fine to merge.

Christovis commented 3 years ago

Hmmm... I get the following result when running the tests:

~/…/datactive/bigbang/tests dev_listserv_tests
❯ pytest test_listserv.py                                                                                   ❮
=============== test session starts ===============
platform linux -- Python 3.9.1, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
rootdir: /home/christovis/02_AGE/datactive/bigbang
plugins: cov-2.11.1
collected 13 items                                                                                                   

test_listserv.py s....s.......                                                                                 [100%]

=============== warnings summary ===============
../config/config.py:8
  /home/christovis/02_AGE/datactive/bigbang/config/config.py:8: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
    dictionary = yaml.load(stream)

-- Docs: https://docs.pytest.org/en/stable/warnings.html
=============== 11 passed, 2 skipped, 1 warning in 16.91s ===============

The assert that fails for you checks that all messages in the list were read. It thus seems like that a lot of them were missed. It is difficult for me to trace back where the bug could be. Could you maybe provide me with information about how many inidividual .LOG files in the 3GPP directory were read?

Regarding the point that there might be more files in /tests/data/3GPP/ I agree. I will reduce it further.

sbenthall commented 3 years ago

I'm getting an error similar to the one @npdoty got:

https://gist.github.com/sbenthall/c704a828cbebf2fd7394d00a9bed3068

It looks to me like there's only 25 listserv log files checked into the repository branch. That may explain why the test is unable to find 57 of them.

Christovis commented 3 years ago

Ufff, I found the reason... the ListservArchive instance contained the ListservList instances in a different order than when I run it on my machine :unamused: In other words the archive containes two mailing-lists [l1, l2], where l1 containes 25 messaged and l2 57. I neglected the importance of registering the order of the mailing lists such that when you ran the tests it was [l1, l2] while for me it was [l2, l1].

As @npdoty suggested in #442, it might be worth keeping tests for accurate scraping. So make those tests optional, it might be a good idea to follow this suggestion:

tests/
    unit/
        test_listserv.py
        test_bigbang.py  <- I think this naming convention is more common
    integration/
        test_optional.py <- optional tests for scraping

What do you think?

Christovis commented 3 years ago

The folder structure that I have implemented now is:

tests/
    data/
        2001-November.txt <- I thought it makes more sense in this folder, than to have it where test_bigbang.py is, as it used to be.
        urls-test-file.txt
        ...
    unit/
        test_listserv.py
        test_bigbang.py  <- I think this naming convention is more common
    webscraping/ <- all optional tests for scraping, thus later we could/should do the same for e.g. W3C
        test_listserv.py

Let me know and I can easily change it again.

Christovis commented 3 years ago

Pfff the error is caused by the same behavior as I described it above. Apparently the messages are stored in a different order within the mailing list since the files are read in a different order. Therefore, the first message (mlist.messages[0]) for me is not the same for you. I adapt the tests accordingly.

Regarding the handling of tests in /tests/unit/ and /tests/webscraping/ what do you think about: 1) We can make the necessary adjustment in the workflow I suggested in #411 such that anything that is merged into master only runs tests in /tests/unit/. 2) Add a pytest.ini that ensures that only /tests/unit/ is the default when tests are done locally.

I think flagging the individual webscraping tests is inconvenient and too low down in the hierarchy than necessary. Simply using something like pytest tests/unit as you suggested is better I think.

npdoty commented 3 years ago

Thanks for fixing those fixture and path issues! (We may still need to handle paths through Python methods rather than string concatenation in the future, but that can wait.)

Tests still fail on my machine:

_____________________________________________ TestListservList.test__to_mbox ______________________________________________

self = <tests.unit.test_listserv.TestListservList object at 0x7fa955e184d0>
mlist = <bigbang.listserv.ListservList object at 0x7fa9566beb50>

    def test__to_mbox(self, mlist):
        mlist.to_mbox(dir_temp, filename=mlist.name)
        file_temp_mbox = f"{dir_temp}/{mlist.name}.mbox"
        f = open(file_temp_mbox, "r")
        lines = f.readlines()
        assert len(lines) == 41623
>       assert lines[21] == "d activities:\n"
E       AssertionError: assert 'I leave out ... its scope.\n' == 'd activities:\n'
E         - d activities:
E         + I leave out everything that is not explicitly 5G or NFV etc. in its scope.

tests/unit/test_listserv.py:71: AssertionError

Looks like a similar problem with assuming a particular order, but I haven't investigated closely.

Christovis commented 3 years ago

Yes, I think you are right. My latest push should have fixed that too now... finger crossed.

datactive / bigbang

Seperate tests according to server dependence #443