datactive / bigbang

Scientific analysis of collaborative communities
http://datactive.github.io/bigbang/
MIT License
148 stars 52 forks source link

E AssertionError: False is not true : Thread message count in restored archive is off #593

Closed sbenthall closed 1 year ago

sbenthall commented 1 year ago

A failing test that seems to be coming rather bizarrely from https://github.com/datactive/bigbang/pull/585 though tests passed on that PR.

The issue seems to be in the test_bigbang.py test.

Namely:

def test_mailman_chain(self):
        name = "bigbang-dev-test.txt"

        # archive loaded from mbox
        arx = archive.Archive(name, archive_dir=CONFIG.test_data_path, mbox=True)

        arx.save("test.csv")

        # archive loaded from stored csv
        arx2 = archive.load("test.csv")

        ....

        self.assertTrue(
            [t.get_num_messages() for t in arx.get_threads()] == [3, 1, 2],
            msg="Thread message count in mbox archive is off",
        )

        self.assertTrue( <---- ERROR HERE
            [t.get_num_messages() for t in arx2.get_threads()] == [3, 1, 2],
            msg="Thread message count in restored archive is off",
        )

The restored email archive is showing a different number of threads with a different number of messages: [4]

This is super weird.

$ python -m pytest tests/unit/test_bigbang.py 
========================================================================= test session starts ==========================================================================
platform linux -- Python 3.8.13, pytest-7.3.1, pluggy-1.0.0
rootdir: /home/sb/projects/bigbang/tests
configfile: pytest.ini
plugins: anyio-3.6.2
collected 8 items                                                                                                                                                      

tests/unit/test_bigbang.py ...F....                                                                                                                              [100%]

=============================================================================== FAILURES ===============================================================================
____________________________________________________________________ TestArchive.test_mailman_chain ____________________________________________________________________

self = <tests.unit.test_bigbang.TestArchive testMethod=test_mailman_chain>

    def test_mailman_chain(self):
        name = "bigbang-dev-test.txt"

        # archive loaded from mbox
        arx = archive.Archive(name, archive_dir=CONFIG.test_data_path, mbox=True)

        arx.save("test.csv")

        # archive loaded from stored csv
        arx2 = archive.load("test.csv")

        print(arx.data.dtypes)
        print(arx.data.shape)

        self.assertTrue(
            arx.data.shape == arx2.data.shape,
            msg="Original and restored archives are different shapes",
        )

        self.assertTrue(
            (arx2.data.index == arx.data.index).all(),
            msg="Original and restored archives have nonidentical indices",
        )

        self.assertTrue(
            [t.get_num_messages() for t in arx.get_threads()] == [3, 1, 2],
            msg="Thread message count in mbox archive is off",
        )
>       self.assertTrue(
            [t.get_num_messages() for t in arx2.get_threads()] == [3, 1, 2],
            msg="Thread message count in restored archive is off",
        )
E       AssertionError: False is not true : Thread message count in restored archive is off
sbenthall commented 1 year ago

This seems to be due to differences in how 'None' fields are represented in the original and restored archive.

(Pdb) arx.data['In-Reply-To'].iloc[0] is None
False
(Pdb) arx2.data['In-Reply-To'].iloc[0] is None
True

These fields are actually missing from the original test data.

sbenthall commented 1 year ago

Fixed with https://github.com/datactive/bigbang/commit/7ee437e1dc78af7b9a939ee0195de0a18d5c88aa