Closed Christovis closed 3 years ago
I get 3 errors on running these tests, all versions of this:
___________________________________ ERROR at setup of TestListservArchive.test__to_dict ___________________________________
self = <tests.test_listserv.TestListservArchive object at 0x7f90abadde10>
@pytest.fixture(name="arch", scope="session")
def test__from_listserv_directory(self):
arch = ListservArchive.from_listserv_directory(
name="3GPP",
directorypath=project_directory + "/tests/data/3GPP/",
)
assert arch.name == "3GPP"
assert len(arch) == 2
> assert len(arch.lists[0]) == 325
E assert 29 == 325
E + where 29 = len(<bigbang.listserv.ListservList object at 0x7f90ac7eeb90>)
tests/test_listserv.py:167: AssertionError
The tests are still on the slow side, and the number of files involved is probably larger than necessary for minimal tests, but it's still a huge improvement on speed. (If the test failure indicates that one version expects to crawl over 10 times as many lists, then I suggest going with the 29 list rather than the 325 version.) If we can fix these failures, then I think it's fine to merge.
Hmmm... I get the following result when running the tests:
~/…/datactive/bigbang/tests dev_listserv_tests
❯ pytest test_listserv.py ❮
=============== test session starts ===============
platform linux -- Python 3.9.1, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
rootdir: /home/christovis/02_AGE/datactive/bigbang
plugins: cov-2.11.1
collected 13 items
test_listserv.py s....s....... [100%]
=============== warnings summary ===============
../config/config.py:8
/home/christovis/02_AGE/datactive/bigbang/config/config.py:8: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
dictionary = yaml.load(stream)
-- Docs: https://docs.pytest.org/en/stable/warnings.html
=============== 11 passed, 2 skipped, 1 warning in 16.91s ===============
The assert that fails for you checks that all messages in the list were read. It thus seems like that a lot of them were missed. It is difficult for me to trace back where the bug could be. Could you maybe provide me with information about how many inidividual .LOG files in the 3GPP directory were read?
Regarding the point that there might be more files in /tests/data/3GPP/ I agree. I will reduce it further.
I'm getting an error similar to the one @npdoty got:
https://gist.github.com/sbenthall/c704a828cbebf2fd7394d00a9bed3068
It looks to me like there's only 25 listserv log files checked into the repository branch. That may explain why the test is unable to find 57 of them.
Ufff, I found the reason... the ListservArchive instance contained the ListservList instances in a different order than when I run it on my machine :unamused: In other words the archive containes two mailing-lists [l1, l2], where l1 containes 25 messaged and l2 57. I neglected the importance of registering the order of the mailing lists such that when you ran the tests it was [l1, l2] while for me it was [l2, l1].
As @npdoty suggested in #442, it might be worth keeping tests for accurate scraping. So make those tests optional, it might be a good idea to follow this suggestion:
tests/
unit/
test_listserv.py
test_bigbang.py <- I think this naming convention is more common
integration/
test_optional.py <- optional tests for scraping
What do you think?
The folder structure that I have implemented now is:
tests/
data/
2001-November.txt <- I thought it makes more sense in this folder, than to have it where test_bigbang.py is, as it used to be.
urls-test-file.txt
...
unit/
test_listserv.py
test_bigbang.py <- I think this naming convention is more common
webscraping/ <- all optional tests for scraping, thus later we could/should do the same for e.g. W3C
test_listserv.py
Let me know and I can easily change it again.
Pfff the error is caused by the same behavior as I described it above. Apparently the messages are stored in a different order within the mailing list since the files are read in a different order. Therefore, the first message (mlist.messages[0]
) for me is not the same for you.
I adapt the tests accordingly.
Regarding the handling of tests in /tests/unit/
and /tests/webscraping/
what do you think about:
1) We can make the necessary adjustment in the workflow I suggested in #411 such that anything that is merged into master only runs tests in /tests/unit/
.
2) Add a pytest.ini that ensures that only /tests/unit/
is the default when tests are done locally.
I think flagging the individual webscraping tests is inconvenient and too low down in the hierarchy than necessary. Simply using something like pytest tests/unit
as you suggested is better I think.
Thanks for fixing those fixture and path issues! (We may still need to handle paths through Python methods rather than string concatenation in the future, but that can wait.)
Tests still fail on my machine:
_____________________________________________ TestListservList.test__to_mbox ______________________________________________
self = <tests.unit.test_listserv.TestListservList object at 0x7fa955e184d0>
mlist = <bigbang.listserv.ListservList object at 0x7fa9566beb50>
def test__to_mbox(self, mlist):
mlist.to_mbox(dir_temp, filename=mlist.name)
file_temp_mbox = f"{dir_temp}/{mlist.name}.mbox"
f = open(file_temp_mbox, "r")
lines = f.readlines()
assert len(lines) == 41623
> assert lines[21] == "d activities:\n"
E AssertionError: assert 'I leave out ... its scope.\n' == 'd activities:\n'
E - d activities:
E + I leave out everything that is not explicitly 5G or NFV etc. in its scope.
tests/unit/test_listserv.py:71: AssertionError
Looks like a similar problem with assuming a particular order, but I haven't investigated closely.
Yes, I think you are right. My latest push should have fixed that too now... finger crossed.
This PR contains the following changes:
/tests/
folder structure as shown hereto_conversationkg_dict()
method toListservList
andListservArchive
classes which creates the required data structure for constructing knowledge graphs with ConversationKG