FreeDiscovery / jwzthreading

Implementation of the JWZ threading algorithm for e-mail or newsgroup messages.
Other
13 stars 2 forks source link

Any description how to use the library? #9

Open benelot opened 6 years ago

benelot commented 6 years ago

Hello!

I am looking for advice on how to use the library. I am having my mails in the unix mbox format and I would like to get my mails threaded. How do I get this to work?

I tried to use the 2010-January.txt file from your tests and wanted to use it as an mbox directly with the jwzthreading.py:

  python3 jwzthreading.py "./2010-January.txt"

However, the library breaks with:

Traceback (most recent call last):
      File "jwzthreading.py", line 592, in <module>
        main()
      File "jwzthreading.py", line 585, in main
        threads = thread(msglist)
      File "jwzthreading.py", line 508, in thread
        (existing.message is not None and
    AttributeError: 'JwzContainer' object has no attribute 'message'

Thanks for any hints.

Also, is it then possible to get a list of emails belonging to a certain thread? I saw that the dictionary at the end of the threading method was dropped, so you no longer have a subject=emails dict.

benelot commented 6 years ago

I think I found something. Without group_by_subject, it works fine. However, I am looking into the Enron data set which does not have any in-reply-to and any references in the mail header, so I need to group by subject.

There the message attribute in the container is accessed in the wrong way. The message attribute of the variable 'existing' seems to be accessible with get('message'). I will keep posting if things work.

Edit: I think I will pull request. In the bottom part of the code (group_by_subject), some of the things are messed up or not updated. I checked in the code this is forked from to understand what it was meant to do and found the necessary changes to make it work.

rth commented 6 years ago

Hello @benelot ,

thanks for opening this issue!

I am having my mails in the unix mbox format and I would like to get my mails threaded. How do I get this to work?

I just added an example; the main() function under jwzthreading.py was indeed not up to date anymore. Hope this helps.

Also, is it then possible to get a list of emails belonging to a certain thread? I saw that the dictionary at the end of the threading method was dropped, so you no longer have a subject=emails dict.

Yes, a thread is a hierarchical structure composed of JwzContainer and you can the list of all containers (or messages) belonging to thread with JwzContainer.flatten()

However, I am looking into the Enron data set which does not have any in-reply-to and any references in the mail header, so I need to group by subject.

We also wanted to run email threading on Enron dataset at some point. Unfortunately, the JWZ algorithm fundamentally relies on the Message-ID and/or In-Reply-To header fields. The grouping by subject is just a small part of it (cf step 5) once the actual threading is computed. I am also not convinced this step is useful: every time I enabled it, it produced false positives (merging of unrelated threads with the same subject together).

When I looked into this last, I haven't really found any good alternative to JWZ that wouldn't need those header fields. There are certainly a few research papers. For instance from Yeh&Harnly (2006),

Recently, some work on threads has been done by heuristics. For example, Wu&Oard(2005) and Zhu&Song (2005) identified threads by linking messages with identical nontrivial subject lines (after removal of any sequence of “re:”, “fw:”, and “fwd:” prefixes). Klimt&Yang(2004) groups messages into a thread if they contain the same words in their subjects and are among the same users (addresses). Lewis&Knowles(1997), instead, regarded email threading as a retrieval problem. They showed that a significant threading effectiveness can be achieved by applying text matching methods to the textual portions of messages. In their work, they studied five retrieval strategies to indicate whether one message is a response to another. Their results exhibited that the most effective strategy is to use the quotation of a message as a query and to match it against the unquoted part of a target message.

But I haven't seen any implementation of those, and I'm not sure if the results would be reliable enough to use in practice. If you find some solution for Enron dataset threading I would be interested to know.

There the message attribute in the container is accessed in the wrong way. The message attribute of the variable 'existing' seems to be accessible with get('message'). I will keep posting if things work.

I'm not sure I have understood the issue, but in any case I am sure there are issues; a PR to fix that would be welcome )

benelot commented 6 years ago

Hi @rth,

The problem I have when running the code is that in some places, the message attribute of the container is accessed by container.message. This does not work here, instead I have to use get("message") instead. You seem to mix both notations in several locations. Without that change, I run into the above mentioned exception.

Furthermore, I had to revert to the former implementation of is_dummy. You seem to have changed that and I am not sure why. Your current implementation always finds 'message' as its key and thus none of the containers are considered a dummy.

I will pull request this to you at some time next week so you can decide if the changes work for you.

Maybe it has something to do with the fact that I am running the code on python3.6? Who knows, I don't.

Regarding Enron: The dataset contains a lot of duplicates, thus it might seem like it clusters messages together which have the same subject. However, without the duplicates I get mostly threads of 2 and a smaller number of threads of 3 mails. The rest seems to stay unthreaded.

rth commented 6 years ago

The problem I have when running the code is that in some places, the message attribute of the container is accessed by container.message.

Note that there are two type of objects, a generic container class Container, JwzContainer (that behaves like a dict and so you can store anything there, including a "message" key) and the Message class that has a message attribute. So you end up with something like,

container = JwzContainer()
container['message'] = Message()
container['message'].message = "something"

I'm not saying that this is a good situation. But that's the result of evolution from the original implementation to the current code base where I needed a more generic container class (also used for to represent hierarchical clustering in this example).

This does not work here, instead, I have to use get("message") instead.

So if you run the included example/parse_mailbox.py (or adapt it to use mailbox.mbox instead of parse_mailbox) you should not run into any issues.

Furthermore, I had to revert to the former implementation of is_dummy. You seem to have changed that and I am not sure why.

I don't remember, it's been a while. Mostly I validated the obtained threading in test_threading_fedora_June2010 if you have another way of validating the results, a PR would definitely be welcome..

benelot commented 6 years ago

Do you depend in other locations outside of this repository on the JwzContainer (I saw that this is part of a larger project)? Otherwise I could give the internals a refactoring to make things clean again.

On Sun, Feb 4, 2018 at 1:26 PM Roman Yurchak notifications@github.com wrote:

The problem I have when running the code is that in some places, the message attribute of the container is accessed by container.message.

Note that there are two type of objects, a generic container class Container, JwzContainer (that behaves like a dict and so you can store anything there, including a "message" key) and the Message class that has a message attribute. So you end up with something like,

container = JwzContainer() container['message'] = Message() container['message'].message = "something"

I'm not saying that this is a good situation. But that's the result of evolution from the original implementation to the current code base where I needed a more generic container class (also used for to represent hierarchical clustering in this example http://freediscovery.io/doc/stable/python/examples/birch_cluster_hierarchy.html#sphx-glr-python-examples-birch-cluster-hierarchy-py ).

This does not work here, instead, I have to use get("message") instead.

So if you run the included example/parse_mailbox.py (or adapt it to use mailbox.mbox instead of parse_mailbox) you should not run into any issues.

Furthermore, I had to revert to the former implementation of is_dummy. You seem to have changed that and I am not sure why.

I don't remember, it's been a while. Mostly I validated the obtained threading in test_threading_fedora_June2010 https://github.com/FreeDiscovery/jwzthreading/blob/d462a36fb823603ea3ce056f1006c06e166ae6b1/jwzthreading/tests/test_newsgroups.py#L54 if you have another way of validating the results, a PR would definitely be welcome..

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FreeDiscovery/jwzthreading/issues/9#issuecomment-362903021, or mute the thread https://github.com/notifications/unsubscribe-auth/AC97q1YHMHQTvZlDTxzp0aUBeFrKr5PCks5tRaHzgaJpZM4R3KSg .

rth commented 6 years ago

Sure, feel free to do so. The larger project currently bundles jwzthreading, so it won't be affected by a refactoring here. (and I could always update the bundled version later). Thanks.