CornellNLP / ConvoKit

ConvoKit is a toolkit for extracting conversational features and analyzing social phenomena in conversations. It includes several large conversational datasets along with scripts exemplifying the use of the toolkit on these datasets.
https://convokit.cornell.edu/documentation/
MIT License
556 stars 129 forks source link

Memory consumption of convokit.model.corpus.Corpus() #74

Closed santoshbs closed 4 years ago

santoshbs commented 4 years ago

I am trying to read a subreddit of size into corpus object through following command:

Corpus(filename= f, 
                               exclude_utterance_meta= True, 
                               exclude_conversation_meta= True,
                               exclude_overall_meta= True,
                               exclude_speaker_meta= True)

The utterances.jsonl file (nba) in 22 gb. The moment I run this, all my 90+ gb RAM is consumed. I was wondering if I am doing anything wrong.

cristiandnm commented 4 years ago

That sounds like the correct ratio...so you are not doing anything wrong, just that the python convenience comes at a pretty heavy RAM cost. If you have suggestions for making things more memory efficient, don't hesitate to let us know.

calebchiam commented 4 years ago

Also note that exclude_[...]_meta takes in a list of metadata keys to exclude, not a boolean.

santoshbs commented 4 years ago

That sounds like the correct ratio...so you are not doing anything wrong, just that the python convenience comes at a pretty heavy RAM cost. If you have suggestions for making things more memory efficient, don't hesitate to let us know.

Since I was interested only in the analysis of utterances text of subreddit, I am now reading the utterances.jsonl file directly in Python rather than using Corpus(). Relatively, this hardly consumes any RAM and is quite fast.

AnnaWegmann commented 4 years ago

Thank you for this cool project!

I am having the same problem as santoshbs. I would like to work on one of the bigger subreddits (probably politics) which is huge. I am guessing this is impossible then if I have a maximum RAM of 100 GB? Instead, is it possible to only use parts of the subreddit (e.g., a given year) with selectors? (maybe I missed this in the docs)

An attempt at lowering RAM cost: Is indexing files (and using Cython) an option?

calebchiam commented 4 years ago

This is technically possible because your machine will just store the Corpus partly in RAM and partly in disk, following which you could filter it down to a given year and work with that instead.

Another option is to use the utterance_start_index and utterance_end_index fields to selectively load only utterances at specific line indices. The utterances are stored in order of timestamp IIRC, so you could try a few selective loads to figure out when utterances of a given year begin, for example.

re: lowering RAM cost, could you elaborate on what you mean by indexing files / using Cython? This is a pain point for us, so we're happy to hear your suggestions.

AnnaWegmann commented 4 years ago

Cool, thanks.

Ah. Okay. So if I call Corpus(download("subreddit-politics")) this will only save a part of the Corpus in RAM? Do you have an approximation of what portion it stores in RAM?

I think, C++ or C have rather fast library implementations of ftell and fseek (http://www.cplusplus.com/reference/cstdio/ftell/, http://www.cplusplus.com/reference/cstdio/fseek/), where you can pretty much directly look up a given line (or a given number of characters from given position) in a file. Then it becomes a whole other issue of deciding which things to have in RAM (probably then mostly lookup tables and possibly lookup tables of lookup tables) and how to make use of that. It might decrease the RAM cost significantly and only reduce runtime by a little. But I am not an expert there, might be that this idea is too simplistic or the runtime increase is too big.

Maybe you do this already? Especially what you said with utterance_start_index and utterance_end_index sounds like it is implemented via a cython call to a C++/C file stream? -- I also found this helpful: http://www.code-corner.de/?p=183

calebchiam commented 4 years ago

Ah, to clarify, I mean that if you've hit your RAM limit, your machine would likely resort to loading the rest of the Corpus using virtual memory (aka disk storage). Working with the Corpus in this state is possible but not ideal because of CPU thrashing from swapping data between RAM and virtual memory -- leading to very slow data processing. (Drawing on my intro computer systems knowledge here, so take this with a pinch of salt.)

Thanks for the suggestions!

utterance_start_index and utterance_end_index use vanilla Python file operations -- we could speed up the IO here, but the main bottleneck is the Corpus object initialization, so the lookup tables you mentioned for deciding what to load into RAM in the first place is what's crucial here.

@jpwchang might have some comments since he made a similar proposal in the past.

AnnaWegmann commented 4 years ago

Ah. Makes sense. Thanks!