Closed jschluger closed 2 years ago
@jschluger Thanks for the PR. I see that checks are currently failing -- is that intended?
@calebchiam The failing checks were not intended, but rather an oversight on my part. Now I am working on fixing all bugs to get the code to pass these tests, rather than just running the tests on my machine. This includes some simple fixes such as checking if directories exist before reading from them e.g. that never came up with local testing.
A a more consequential necessary fix is to setup a database for these automatic tests to use to test DB storage mode. I'm not sure what the best approach for this is right now, but am looking into it now. If anyone has experience with anything like this, I'd appreciate any suggestions.
I've resolved all the CI Checks issues!
I've got a long-haul flight tomorrow, so should be able to take a look at the code. ~In the meantime, I see the tests are failing due to timeout issues -- they're somehow exceeding the 6-hour mark. @jschluger, any idea what the issue might be?~ Scratch that, re-ran them and the issue resolved itself.
Code Review Fixed. Moving on to fixing Corpus.merge issue with merge generating a new ID/name for the merged corpuses. Most likely will be adding a new parameter into the merge function that defines the merged corpus name.
Hopped on a call with Jonathan to discuss what is the best way to progress with fixing the bug. Created a merge_helper function which acts as a private function for toggling modify. Modify = True in add utterances, which returns self to keep track of corpus and ID. whereas merge is now a public function consisting of a one liner that calls merge_helper with no option to toggle modify (which is always False in this case) The DB now updates if the corpus is initially not in the DB. However, it still fails to update if the corpus is already in the DB because of the version issue. Ie. If corpus is not in DB, v0 is used and is updated like normal. However, if corpus is already in DB, iit will continuously up the version count 0.1, 0.2, etc… the respective version will be correct though, but not base raw version 0. I am not sure what we want to do with this issue and I don't know if it’s correct. For now, I can always just set the version to be the same, but to be honest, if this is the case, I don’t know what the purpose of having versions even is. I have committed my current changes to GitHub just to keep everyone updated. Please see the google folder here for the two scenarios to have a better understanding. For the HTML files, please download and open on your browser to view.
To DO:
Failing at this stage, not sure what it means. Compared to before, it seems to be failing at downloading spacy en core sm
@oscarso2000 I looked up the error and unfortunately it appears to be a new issue on the SpaCy side. They are tracking the problem here: https://github.com/explosion/spaCy/issues/10564
SpaCy compile bug seems to be fixed as someone updated typer to 0.4.1.
I will do final tests and move jupyter notebooks into the right folders after removing all secrets within the week.
@calebchiam Why do you think that the 3.8 runs didn't succeed? It seemed like they were just stuck instead of succeeding? and you ultimately needed to cancel?
@oscarso2000 Not sure, but based on the re-run, it seems to be a persistent issue.
@calebchiam What I do notice is that all tests pass. When comparing between 3.7 and 3.8, it just seems like the 3.8 one doesn't go from tests to post setup. I'm trying to google if this has happened to anyone else but so far to no avail as I don't really know what I should be searching up.
this is very weird, the mem for 3.8 now is successful with CI but not the db for 3.8. Makes no sense.
essentially what's happening is that the tests are run successfully but something is preventing it from ending the tests and moving on. Is it something to do with the call stack or maybe DB is still open? The only issue is that the other 5 check works (and the mem for 3.8 just miraculously worked)
Commit 73542ba952c26246091b095e8b63f2417249716b is the same as (same code) commit 3f9d6d6016067957779866fd9b4b1c57039946d9 . The first one passes and somehow the second one doesn't. Highly odd in my opinion. @jschluger @calebchiam @jpwchang. I am still trying to create a venv for 3.8 on my laptop. Will update when I have results. for that. But maybe it could be something to do with the typer/spacy update as well. Lots of open possibilities.
With commit 3f9d6d6016067957779866fd9b4b1c57039946d9, I reran failed tests and it succeeded. Honestly might be something to do with memory. Thus, I will be checking on what is breaking it, if any.
@jpwchang @calebchiam No code has changed, but set in_place= True again, and the 6 checks are successful after 2 runs. I am very confused.
Nice work, @oscarso2000! I'll try to review this by the end of next week (have some launch deadlines at my company), but @jpwchang, feel free to review this instead if you can find the time.
Tested on examples directory and works locally.
Updated db documentation for users to read. (the .rst in docs folder) one is called db_setup.rst and the other is storage_options.rst
@calebchiam given that I've been working closely with Oscar on this, it would be preferable if you could do the review (to give a more unbiased take). But let us know if that's not feasible for you anymore.
@calebchiam Thank you in advance. It is greatly appreciated.
@oscarso2000 One more minor change to request: I noticed there are some commented-out lines of code in the which_storage_mode.ipynb
example notebook. Presumably these are left over from testing. We should make sure to get rid of these; since these notebooks are meant as our public-facing example/tutorial code we need to ensure that they are clear, and leftover commented-out code can be confusing.
Two more things before merge:
self.corpus_dirpath
issue at least)Thanks!
@oscarso2000 @calebchiam I have identified a possible issue with dependencies. According to the documentation, we are presenting DB mode as an optional feature; i.e., someone who does not want to use the DB functionality can in theory install convokit and skip the MongoDB-specific steps. As such, pymongo
is not listed as a requirement in setup.py
.
I just tried this myself, building a local copy of the convokit package from the db-storage branch and installing it into a clean conda environment without mongodb. Installation works, but the result is that trying to import convokit causes a crash because convoKitMeta.py
attempts to import DBDocumentMapping
, which in turn has an import for bson
(a pymongo component), which is of course not found since pymongo is not installed.
Do either of you know the best way to address this? I imagine we can do something similar to how we handled pytorch, which I believe is also an optional dependency?
@oscarso2000 @calebchiam Just found a second (slightly related) issue with installation. Turns out, there is an error you would run into even earlier when trying to import convokit, regardless of the mongodb status:
>>> import convokit
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jpchang/.conda/envs/piptest/lib/python3.8/site-packages/convokit/__init__.py", line 1, in <module>
from .model import *
File "/home/jpchang/.conda/envs/piptest/lib/python3.8/site-packages/convokit/model/__init__.py", line 1, in <module>
from .conversation import Conversation
File "/home/jpchang/.conda/envs/piptest/lib/python3.8/site-packages/convokit/model/conversation.py", line 2, in <module>
from .utterance import Utterance
File "/home/jpchang/.conda/envs/piptest/lib/python3.8/site-packages/convokit/model/utterance.py", line 3, in <module>
from .corpusComponent import CorpusComponent
File "/home/jpchang/.conda/envs/piptest/lib/python3.8/site-packages/convokit/model/corpusComponent.py", line 3, in <module>
from .convoKitMeta import ConvoKitMeta
File "/home/jpchang/.conda/envs/piptest/lib/python3.8/site-packages/convokit/model/convoKitMeta.py", line 6, in <module>
from convokit.storage import DBDocumentMapping, StorageManager
ModuleNotFoundError: No module named 'convokit.storage'
>>>
It seems like the only reason I didn't get this error the first time I tried was that back then I was testing from inside my local repo directory, and I think it was grabbing the convokit.storage from my local directory.
Ok, managed to fix the convokit.storage
issue by adding it to setup.py. We still need to discuss the MongoDB dependency issue however.
I noticed too that it was missing from setup.py
and commented above, but thought it was missed out by accident. I don't see any reason to make it optional though: (1) I think we can expect it to be used fairly frequently (since it should make working with any of our larger corpora significantly easier), (2) the total size of the dependencies is ~1.5MB. We made torch
optional because the package size is 50-150MB (depending on OS). Given that, I think it's fine to have these packages installed by default.
Hi All,
I see that you guys are currently working on it. I don't want to cause merge conflicts so I'll wait until tonight? As with explanations of the code base, most of the code was written up by Charlotte. I was tasked to debug the issue with the add_utterances creating a new instance instead of updating the original corpus. I can make a few guesses here and there but that's about it. For speaker_id in Utterance, i guess it is just storing the "username" or something. Especially for CRAFT, when we have reddit posts/comments, we save the id that is registered to the username (the id can be found in the DOM) just as a reference.
Oscar
@oscarso2000 you can go ahead and make changes -- I won't be editing the code further until you've had a chance to resolve the comments.
@jpwchang Are you able to address the questions?
@calebchiam @jpwchang is load_vector_reprs
and dump_vector_reprs
supposed to be commented out?
@oscarso2000 Yep, it's obsolete, could be deleted actually.
@jpwchang config.yml
needs to have a version number so that we can configure when to update it if/when we change its design in the future. We could ignore that for now and incur the cost of dealing with that in the future. Just putting it out there for posterity.
I'm gonna leave a task for here checking the .ipynb notebooks in /examples
, some of which seem to be failing as a result of this PR.
Corpus.filename
is not set, but can be worked around by explicitly setting the filename parameter)Corpus.filter_utterances_by
fails to preserve conversation metadata)Corpus.filename
is not set)base_path
, we will have to change this once we settle on what to rename that parameter)corpus.dump
is failing)Conversation.iter_speakers()
is not working)Please tick these off for whichever ones you've checked and highlight any issues in this thread.
@oscarso2000 @jpwchang @cristiandnm
Vectors
vector_demo.ipynb
is not running correctly. If you run this .ipynb and check Cell 16, you'll see that corpus.vectors
is an empty set, despite vector matrices being associated with the Corpus. This is true in Cell 33 and 35 as well. When fixing this, please compare all cells and verify sameness.
@oscarso2000 Note that changing self.corpus_dirpath
to self.filename
does not resolve the issue I mentioned before because self.filename
is never initialized in corpus.py
.
Here's a sample piece of code that fails as a result:
[load / initialize any corpus object]
import numpy as np
vect = np.random.random((10, 5))
matrix = ConvoKitMatrix(name='matrix', matrix=vect)
corpus.append_vector_matrix(matrix)
corpus.dump_vectors('local_dump')
This will raise an error: AttributeError: 'Corpus' object has no attribute 'filename'
Conversation.iter_speakers()
raises an exception.
For any loaded corpus, you can run this to replicate:
convo = corpus.random_conversation()
list(convo.iter_speakers())
This raises AttributeError: 'NoneType' object has no attribute 'get_speaker'
This also causes sigdial-demo.ipynb
to fail.
@calebchiam I'll look more into the filename issue
That's odd. @calebchiam My notebook seemed to work last week, but now it doesn't. Maybe it was importing the old convokit before instead of the local one.
@calebchiam @oscarso2000 I did some digging and it seems like the vector problem and the iter_speakers problem have the same root cause. If you look at corpusComponent.py
(the base class for Utterance
, Speaker
, and Conversation
), lines 28-29 in the constructor have the following code:
if from_db:
return
Which means that in DB mode nothing past line 29 of the constructor gets run. Problem is both vectors and owner (the latter of which is referred to by iter_speakers
) are set in code below line 29.
But beyond this I am completely at a loss, because I have no idea why the code I pasted above exists. @jschluger on the off chance you see this, would you be able to explain the intent here? Because I am hesitant to get rid of code whose purpose I don't understand, as I am afraid of breaking something downstream...
@jpwchang I have been testing with storage_mode='mem'
on the corpus download and it seems to not work
@oscarso2000 downloading with mem mode seems to work for me, what specific error are you seeing?
@jpwchang downloading works, but the call on LOC 16 for corpus.vectors still gives me set() instead of ["bow"]
More precisely, setting the environment config of export CONVOKIT_STORAGE_MODE, to either 'db' or 'mem' don't help.
@oscarso2000 @calebchiam Ok, this one was a bit of a doozy, but I think I have it figured out?
The reason Oscar is seeing corpus.vectors
not work even in mem mode is due to a weirdness in how corpus.vectors
is defined. corpus.vectors
is a property which, to external-facing clients, appears to be of type set
. As such, the public function add_vector
works by calling the set's add
function. However, in reality the property corpus.vectors
is backed by the "vectors" entry in the index
, which in turn is not a set but rather a list. There is a type conversion in the getter and setter to switch between the two. But this type conversion doesn't play well with the implicit mutability offered by collection-type properties, so changes do not actually get recorded.
I'm not sure why this hidden type conversion exists, but we can solve the issue by just deciding whether we want vectors
to be a set or a list, and making it consistent throughout. Through a toy example I've confirmed that implicit mutability works with properties that are simply a list or simply a set.
@jpwchang We can make vectors a list. IIRC, I made it a set to optimize the check for whether a vector is present, but it's not a big deal to have that optimization. The conversion to list in index is presumably because sets aren't a concept in JSON, so we have to convert it to a list to before dumping the index to JSON format.
@calebchiam vectors field is now a list and I can confirm the notebook now works as expected in mem mode.
Fixed a small bug with vectors (duplicate entry in list problem), but corpus.dump
at the end of the notebook is still failing.
Implementing abstract storage for convokit: all Corpus and CorpusComponent objects now have a self.storage instance variable referring to a convokit.StorageManager object which...manages the storage for that object. Within the StorageManager class, two distinct storage options are abstracted away from the end user: in memory storage (what Convokit has always provided) and database storage. I implement the DBCollectionMapping, DBDocumentMapping, MemCollectionMapping, and MemDocumentMapping classes extending the MutableMapping interface to store collections of items, and the data for a single item, in the two storage modes.