CornellNLP / ConvoKit

ConvoKit is a toolkit for extracting conversational features and analyzing social phenomena in conversations. It includes several large conversational datasets along with scripts exemplifying the use of the toolkit on these datasets.
https://convokit.cornell.edu/documentation/
MIT License
542 stars 121 forks source link

Db Storage #140

Closed jschluger closed 2 years ago

jschluger commented 2 years ago

Implementing abstract storage for convokit: all Corpus and CorpusComponent objects now have a self.storage instance variable referring to a convokit.StorageManager object which...manages the storage for that object. Within the StorageManager class, two distinct storage options are abstracted away from the end user: in memory storage (what Convokit has always provided) and database storage. I implement the DBCollectionMapping, DBDocumentMapping, MemCollectionMapping, and MemDocumentMapping classes extending the MutableMapping interface to store collections of items, and the data for a single item, in the two storage modes.

calebchiam commented 2 years ago

@jschluger Thanks for the PR. I see that checks are currently failing -- is that intended?

jschluger commented 2 years ago

@calebchiam The failing checks were not intended, but rather an oversight on my part. Now I am working on fixing all bugs to get the code to pass these tests, rather than just running the tests on my machine. This includes some simple fixes such as checking if directories exist before reading from them e.g. that never came up with local testing.

A a more consequential necessary fix is to setup a database for these automatic tests to use to test DB storage mode. I'm not sure what the best approach for this is right now, but am looking into it now. If anyone has experience with anything like this, I'd appreciate any suggestions.

jschluger commented 2 years ago

I've resolved all the CI Checks issues!

calebchiam commented 2 years ago

I've got a long-haul flight tomorrow, so should be able to take a look at the code. ~In the meantime, I see the tests are failing due to timeout issues -- they're somehow exceeding the 6-hour mark. @jschluger, any idea what the issue might be?~ Scratch that, re-ran them and the issue resolved itself.

oscarso2000 commented 2 years ago

Code Review Fixed. Moving on to fixing Corpus.merge issue with merge generating a new ID/name for the merged corpuses. Most likely will be adding a new parameter into the merge function that defines the merged corpus name.

oscarso2000 commented 2 years ago

Hopped on a call with Jonathan to discuss what is the best way to progress with fixing the bug. Created a merge_helper function which acts as a private function for toggling modify. Modify = True in add utterances, which returns self to keep track of corpus and ID. whereas merge is now a public function consisting of a one liner that calls merge_helper with no option to toggle modify (which is always False in this case) The DB now updates if the corpus is initially not in the DB. However, it still fails to update if the corpus is already in the DB because of the version issue. Ie. If corpus is not in DB, v0 is used and is updated like normal. However, if corpus is already in DB, iit will continuously up the version count 0.1, 0.2, etc… the respective version will be correct though, but not base raw version 0. I am not sure what we want to do with this issue and I don't know if it’s correct. For now, I can always just set the version to be the same, but to be honest, if this is the case, I don’t know what the purpose of having versions even is. I have committed my current changes to GitHub just to keep everyone updated. Please see the google folder here for the two scenarios to have a better understanding. For the HTML files, please download and open on your browser to view.

oscarso2000 commented 2 years ago

To DO:

oscarso2000 commented 2 years ago
Screen Shot 2022-03-28 at 6 46 39 PM

Failing at this stage, not sure what it means. Compared to before, it seems to be failing at downloading spacy en core sm

jpwchang commented 2 years ago

@oscarso2000 I looked up the error and unfortunately it appears to be a new issue on the SpaCy side. They are tracking the problem here: https://github.com/explosion/spaCy/issues/10564

oscarso2000 commented 2 years ago

SpaCy compile bug seems to be fixed as someone updated typer to 0.4.1.

oscarso2000 commented 2 years ago

I will do final tests and move jupyter notebooks into the right folders after removing all secrets within the week.

oscarso2000 commented 2 years ago

@calebchiam Why do you think that the 3.8 runs didn't succeed? It seemed like they were just stuck instead of succeeding? and you ultimately needed to cancel?

calebchiam commented 2 years ago

@oscarso2000 Not sure, but based on the re-run, it seems to be a persistent issue.

oscarso2000 commented 2 years ago

@calebchiam What I do notice is that all tests pass. When comparing between 3.7 and 3.8, it just seems like the 3.8 one doesn't go from tests to post setup. I'm trying to google if this has happened to anyone else but so far to no avail as I don't really know what I should be searching up.

oscarso2000 commented 2 years ago

this is very weird, the mem for 3.8 now is successful with CI but not the db for 3.8. Makes no sense.

oscarso2000 commented 2 years ago

essentially what's happening is that the tests are run successfully but something is preventing it from ending the tests and moving on. Is it something to do with the call stack or maybe DB is still open? The only issue is that the other 5 check works (and the mem for 3.8 just miraculously worked)

oscarso2000 commented 2 years ago

Commit 73542ba952c26246091b095e8b63f2417249716b is the same as (same code) commit 3f9d6d6016067957779866fd9b4b1c57039946d9 . The first one passes and somehow the second one doesn't. Highly odd in my opinion. @jschluger @calebchiam @jpwchang. I am still trying to create a venv for 3.8 on my laptop. Will update when I have results. for that. But maybe it could be something to do with the typer/spacy update as well. Lots of open possibilities.

oscarso2000 commented 2 years ago

With commit 3f9d6d6016067957779866fd9b4b1c57039946d9, I reran failed tests and it succeeded. Honestly might be something to do with memory. Thus, I will be checking on what is breaking it, if any.

oscarso2000 commented 2 years ago

@jpwchang @calebchiam No code has changed, but set in_place= True again, and the 6 checks are successful after 2 runs. I am very confused.

calebchiam commented 2 years ago

Nice work, @oscarso2000! I'll try to review this by the end of next week (have some launch deadlines at my company), but @jpwchang, feel free to review this instead if you can find the time.

oscarso2000 commented 2 years ago

Tested on examples directory and works locally.

oscarso2000 commented 2 years ago

Updated db documentation for users to read. (the .rst in docs folder) one is called db_setup.rst and the other is storage_options.rst

jpwchang commented 2 years ago

@calebchiam given that I've been working closely with Oscar on this, it would be preferable if you could do the review (to give a more unbiased take). But let us know if that's not feasible for you anymore.

oscarso2000 commented 2 years ago

@calebchiam Thank you in advance. It is greatly appreciated.

jpwchang commented 2 years ago

@oscarso2000 One more minor change to request: I noticed there are some commented-out lines of code in the which_storage_mode.ipynb example notebook. Presumably these are left over from testing. We should make sure to get rid of these; since these notebooks are meant as our public-facing example/tutorial code we need to ensure that they are clear, and leftover commented-out code can be confusing.

calebchiam commented 2 years ago

Two more things before merge:

  1. @oscarso2000, could you run all .ipynb notebooks using ConvoKit v3.0? I believe this will be useful for surfacing errors in untested parts of the code (it would have surfaced the Corpus self.corpus_dirpath issue at least)
  2. For whoever is doing the merge, please make sure to squash and merge. This is a huge PR, so we'll absolutely want this set of changes to be consolidated for future reference and debugging.

Thanks!

jpwchang commented 2 years ago

@oscarso2000 @calebchiam I have identified a possible issue with dependencies. According to the documentation, we are presenting DB mode as an optional feature; i.e., someone who does not want to use the DB functionality can in theory install convokit and skip the MongoDB-specific steps. As such, pymongo is not listed as a requirement in setup.py.

I just tried this myself, building a local copy of the convokit package from the db-storage branch and installing it into a clean conda environment without mongodb. Installation works, but the result is that trying to import convokit causes a crash because convoKitMeta.py attempts to import DBDocumentMapping, which in turn has an import for bson (a pymongo component), which is of course not found since pymongo is not installed.

Do either of you know the best way to address this? I imagine we can do something similar to how we handled pytorch, which I believe is also an optional dependency?

jpwchang commented 2 years ago

@oscarso2000 @calebchiam Just found a second (slightly related) issue with installation. Turns out, there is an error you would run into even earlier when trying to import convokit, regardless of the mongodb status:

>>> import convokit
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpchang/.conda/envs/piptest/lib/python3.8/site-packages/convokit/__init__.py", line 1, in <module>
    from .model import *
  File "/home/jpchang/.conda/envs/piptest/lib/python3.8/site-packages/convokit/model/__init__.py", line 1, in <module>
    from .conversation import Conversation
  File "/home/jpchang/.conda/envs/piptest/lib/python3.8/site-packages/convokit/model/conversation.py", line 2, in <module>
    from .utterance import Utterance
  File "/home/jpchang/.conda/envs/piptest/lib/python3.8/site-packages/convokit/model/utterance.py", line 3, in <module>
    from .corpusComponent import CorpusComponent
  File "/home/jpchang/.conda/envs/piptest/lib/python3.8/site-packages/convokit/model/corpusComponent.py", line 3, in <module>
    from .convoKitMeta import ConvoKitMeta
  File "/home/jpchang/.conda/envs/piptest/lib/python3.8/site-packages/convokit/model/convoKitMeta.py", line 6, in <module>
    from convokit.storage import DBDocumentMapping, StorageManager
ModuleNotFoundError: No module named 'convokit.storage'
>>> 

It seems like the only reason I didn't get this error the first time I tried was that back then I was testing from inside my local repo directory, and I think it was grabbing the convokit.storage from my local directory.

jpwchang commented 2 years ago

Ok, managed to fix the convokit.storage issue by adding it to setup.py. We still need to discuss the MongoDB dependency issue however.

calebchiam commented 2 years ago

I noticed too that it was missing from setup.py and commented above, but thought it was missed out by accident. I don't see any reason to make it optional though: (1) I think we can expect it to be used fairly frequently (since it should make working with any of our larger corpora significantly easier), (2) the total size of the dependencies is ~1.5MB. We made torch optional because the package size is 50-150MB (depending on OS). Given that, I think it's fine to have these packages installed by default.

oscarso2000 commented 2 years ago

Hi All,

I see that you guys are currently working on it. I don't want to cause merge conflicts so I'll wait until tonight? As with explanations of the code base, most of the code was written up by Charlotte. I was tasked to debug the issue with the add_utterances creating a new instance instead of updating the original corpus. I can make a few guesses here and there but that's about it. For speaker_id in Utterance, i guess it is just storing the "username" or something. Especially for CRAFT, when we have reddit posts/comments, we save the id that is registered to the username (the id can be found in the DOM) just as a reference.

Oscar

calebchiam commented 2 years ago

@oscarso2000 you can go ahead and make changes -- I won't be editing the code further until you've had a chance to resolve the comments.

@jpwchang Are you able to address the questions?

oscarso2000 commented 2 years ago

@calebchiam @jpwchang is load_vector_reprs and dump_vector_reprs supposed to be commented out?

calebchiam commented 2 years ago

@oscarso2000 Yep, it's obsolete, could be deleted actually.

calebchiam commented 2 years ago

@jpwchang config.yml needs to have a version number so that we can configure when to update it if/when we change its design in the future. We could ignore that for now and incur the cost of dealing with that in the future. Just putting it out there for posterity.

calebchiam commented 2 years ago

I'm gonna leave a task for here checking the .ipynb notebooks in /examples, some of which seem to be failing as a result of this PR.

Please tick these off for whichever ones you've checked and highlight any issues in this thread.

@oscarso2000 @jpwchang @cristiandnm

calebchiam commented 2 years ago

Vectors

vector_demo.ipynb is not running correctly. If you run this .ipynb and check Cell 16, you'll see that corpus.vectors is an empty set, despite vector matrices being associated with the Corpus. This is true in Cell 33 and 35 as well. When fixing this, please compare all cells and verify sameness.

calebchiam commented 2 years ago

@oscarso2000 Note that changing self.corpus_dirpath to self.filename does not resolve the issue I mentioned before because self.filename is never initialized in corpus.py.

Here's a sample piece of code that fails as a result:

[load / initialize any corpus object]
import numpy as np
vect = np.random.random((10, 5))
matrix = ConvoKitMatrix(name='matrix', matrix=vect)
corpus.append_vector_matrix(matrix)
corpus.dump_vectors('local_dump')

This will raise an error: AttributeError: 'Corpus' object has no attribute 'filename'

calebchiam commented 2 years ago

Conversation.iter_speakers() raises an exception.

For any loaded corpus, you can run this to replicate:

convo = corpus.random_conversation()
list(convo.iter_speakers())

This raises AttributeError: 'NoneType' object has no attribute 'get_speaker'

This also causes sigdial-demo.ipynb to fail.

oscarso2000 commented 2 years ago

@calebchiam I'll look more into the filename issue

oscarso2000 commented 2 years ago

That's odd. @calebchiam My notebook seemed to work last week, but now it doesn't. Maybe it was importing the old convokit before instead of the local one.

jpwchang commented 2 years ago

@calebchiam @oscarso2000 I did some digging and it seems like the vector problem and the iter_speakers problem have the same root cause. If you look at corpusComponent.py (the base class for Utterance, Speaker, and Conversation), lines 28-29 in the constructor have the following code:

if from_db:
    return

Which means that in DB mode nothing past line 29 of the constructor gets run. Problem is both vectors and owner (the latter of which is referred to by iter_speakers) are set in code below line 29.

But beyond this I am completely at a loss, because I have no idea why the code I pasted above exists. @jschluger on the off chance you see this, would you be able to explain the intent here? Because I am hesitant to get rid of code whose purpose I don't understand, as I am afraid of breaking something downstream...

oscarso2000 commented 2 years ago

@jpwchang I have been testing with storage_mode='mem' on the corpus download and it seems to not work

jpwchang commented 2 years ago

@oscarso2000 downloading with mem mode seems to work for me, what specific error are you seeing?

oscarso2000 commented 2 years ago

@jpwchang downloading works, but the call on LOC 16 for corpus.vectors still gives me set() instead of ["bow"]

oscarso2000 commented 2 years ago

More precisely, setting the environment config of export CONVOKIT_STORAGE_MODE, to either 'db' or 'mem' don't help.

jpwchang commented 2 years ago

@oscarso2000 @calebchiam Ok, this one was a bit of a doozy, but I think I have it figured out?

The reason Oscar is seeing corpus.vectors not work even in mem mode is due to a weirdness in how corpus.vectors is defined. corpus.vectors is a property which, to external-facing clients, appears to be of type set. As such, the public function add_vector works by calling the set's add function. However, in reality the property corpus.vectors is backed by the "vectors" entry in the index, which in turn is not a set but rather a list. There is a type conversion in the getter and setter to switch between the two. But this type conversion doesn't play well with the implicit mutability offered by collection-type properties, so changes do not actually get recorded.

I'm not sure why this hidden type conversion exists, but we can solve the issue by just deciding whether we want vectors to be a set or a list, and making it consistent throughout. Through a toy example I've confirmed that implicit mutability works with properties that are simply a list or simply a set.

calebchiam commented 2 years ago

@jpwchang We can make vectors a list. IIRC, I made it a set to optimize the check for whether a vector is present, but it's not a big deal to have that optimization. The conversion to list in index is presumably because sets aren't a concept in JSON, so we have to convert it to a list to before dumping the index to JSON format.

jpwchang commented 2 years ago

@calebchiam vectors field is now a list and I can confirm the notebook now works as expected in mem mode.

calebchiam commented 2 years ago

Fixed a small bug with vectors (duplicate entry in list problem), but corpus.dump at the end of the notebook is still failing.