finalfusion / finalfrontier

Context-sensitive word embeddings with subwords. In Rust.
https://finalfusion.github.io/finalfrontier
Other
87 stars 4 forks source link

Some general vocab questions #70

Closed jguhlin closed 3 years ago

jguhlin commented 5 years ago

Hello. I'm a genomics researcher interested in using finalfrontier to create embeddings based on DNA and protein sequences. Unfortunately, I'm a bit new to rust and very new to the finalfrontier codebase. I've got a few issues right away I need some help with (but likely to have many more).

For DNA we use kmers (like ngrams, as DNA is essentially very large continuous strings) with a sliding window approach, and I've code to count a large corpus in about 3 hours (300 million unique kmers, 50Gb compressed -- almost as fast as FastText, without the extra processing on the front-end nor the extra few hundred Gb of data). I'd like to get this into a SubwordVocab.

1) It would be great to get a function to supplement count (vocab/mod.rs) and pass a known value? As the corpus is already processed this would speed things up rather than calling count() multiple times. I can create a PR if that would help.

2) Is it possible to create a way to skip bracketing for ngram creation? Happy to create a PR as well.

3) Is it possible to create a set of specific ngram values instead of a range: 9 and 11, rather than 9, 10, and 11?

4) I am storing everything at Vec but it seems like everything is String in finalfusion. This is more of a performance question, will it hurt anything if I switch all of my kmers over to String?

Or -- Should I focus instead on creating a different vocab implementation instead so as not to mess up anything you have already?

Any and all help is greatly appreciated!

Cheers, --Joseph

sebpuetz commented 5 years ago

Hi,

thanks for your interest!

1. It would be great to get a function to supplement count (vocab/mod.rs) and pass a known value? As the corpus is already processed this would speed things up rather than calling count() multiple times. I can create a PR if that would help.

I think the straight forward way would then be to not use our VocabBuilder at all and construct a SubwordVocab through its new() fn. It takes a config struct, a sorted Vec<CountedType> holding the vocab entries - indices are asigned in the order as given, no sorting happens in the constructor - the total number of tokens in the source (including those that didn't make the min-count cut) and a finalfusion::Indexer.

2. Is it possible to create a way to skip bracketing for ngram creation? Happy to create a PR as well.

We don't currently have such an option and the bracketing happens not only in the vocab creating in finalfrontier but also during lookup in finalfusion. Making that optional could be complicated because we'd want consistency across our different tools.

There is the work-around of manually creating a finalfusion::NGramIndexer, it just takes a Vec<String> and assigns indices in the given order. If you rely on finalfusion to do your lookups, the items would still be bracketed but since NGramIndexer explicitly stores the ngrams, it simply wouldn't retrieve any bracketed items. Note that this implementation is quite different from fastText style bucket-hashing where many ngrams' indices collide.

3. Is it possible to create a set of specific ngram values instead of a range: 9 and 11, rather than 9, 10, and 11?

Pretty much the same as above, if you construct the NGramIndexer yourself, you can provide ngrams only in the range you'd like. During lookup, all lenghts of ngrams would get extracted in the specified range but only those present in the Indexer would get an embedding.

Again, adding a setting to skip certain lengths would be complicated because we'd have to break compatibility with our other tools which just expect two integers indicating a range.

4. I am storing everything at Vec but it seems like everything is String in finalfusion. This is more of a performance question, will it hurt anything if I switch all of my kmers over to String?

A String is really just a Vec<u8> under the hood but ensured to be valid UTF-8. If your data is actually just bytes, then you'll probably incur a slight performance hit but one that's most likely negligible as the runtime is dominated by the computation of embeddings and the dot products during training.

Or -- Should I focus instead on creating a different vocab implementation instead so as not to mess up anything you have already?

I believe you'd be writing your own binary to run the training, similar to what we have in finalfrontier-utils. In that crate you could mostly rely on finalfusion and finalfrontier's API, although some of what I described above is not released yet, so you'd have to depend on versions from GitHub. E.g. the explicit ngram-lookup models based on NGramIndexer are not released yet.

If you have additional questions or run into any problems along the way, feel free to comment here. Would be interesting to see what kind of edges our API has and whether we can smooth them out.

Hope that helps for now!

Best, Sebastian

jguhlin commented 5 years ago

Thanks, it does. I'm going to get it working then backtrack and build the customized NGramIndexer. I have been using finalfrontier-utils as my reference and replacing functionality as needed.

Cheers

jguhlin commented 5 years ago

Hey, so I bet this is something stupid I'm missing, but I've been banging my head against it all day.

error[E0277]: the trait bound `finalfusion::chunks::vocab::VocabWrap: std::convert::From<finalfrontier::vocab::subword::SubwordVocab<finalfrontier::config::NGramConfig, finalfusion::subword::NGramIndexer>>` is not satisfied
  --> oracular/src/main.rs:78:5
   |
78 |     train(vocab, filename, kmer_size)
   |     ^^^^^ the trait `std::convert::From<finalfrontier::vocab::subword::SubwordVocab<finalfrontier::config::NGramConfig, finalfusion::subword::NGramIndexer>>` is not implemented for `finalfusion::chunks::vocab::VocabWrap`
   |
   = help: the following implementations were found:
             <finalfusion::chunks::vocab::VocabWrap as std::convert::From<finalfusion::chunks::vocab::SimpleVocab>>
             <finalfusion::chunks::vocab::VocabWrap as std::convert::From<finalfusion::chunks::vocab::SubwordVocab<finalfusion::compat::fasttext::indexer::FastTextIndexer>>>
             <finalfusion::chunks::vocab::VocabWrap as std::convert::From<finalfusion::chunks::vocab::SubwordVocab<finalfusion::subword::HashIndexer<fnv::FnvHasher>>>>
             <finalfusion::chunks::vocab::VocabWrap as std::convert::From<finalfusion::chunks::vocab::SubwordVocab<finalfusion::subword::NGramIndexer>>>
   = note: required because of the requirements on the impl of `std::convert::Into<finalfusion::chunks::vocab::VocabWrap>` for `finalfrontier::vocab::subword::SubwordVocab<finalfrontier::config::NGramConfig, finalfusion::subword::NGramIndexer>`
   = note: required by `liboracular::embeddings::train`

To convert from my dict to the vocab the fn def looks like

pub fn build_vocab_from_finaldict<V, C>(config: C, dict: FinalDict) -> V
// where V: Vocab<VocabType = String> + From<VocabBuilder<SubwordVocabConfig<NGramConfig>, String>>,
where
    V: Vocab<VocabType = String> + From<VocabBuilder<SubwordVocabConfig<NGramConfig>, String>>,
    VocabBuilder<SubwordVocabConfig<NGramConfig>, String>: Into<V>,
{
    let config = SubwordVocabConfig {
                discard_threshold: 1e-4,
                min_count: 5,
                max_n: 11,
                min_n: 9,
                indexer: NGramConfig { min_ngram_count: 50 },
    };

    let mut builder = VocabBuilder::new(config);

And the train fn looks like:

pub fn train<V>(vocab: V, filename: &str, kmer_size: usize)
where
    V: Vocab<VocabType = String> + Into<VocabWrap> + Clone + Send + Sync + 'static,
    V::Config: Serialize,
    for<'a> &'a V::IdxType: IntoIterator<Item = u64>,

And the fn call looks like:

    let vocab: SubwordVocab<_, _> = build_vocab_from_finaldict(config, final_dict);
    train(vocab, filename, kmer_size)

Is it something obvious I'm missing? Using the current master branches of finalfrontier and finalfusion, (finalfusion-rust), with a git pull earlier today. So could be some unimplemented code but it's probably something I've done. Any ideas? Cheers

sebpuetz commented 5 years ago

I'm not really sure about what's going on here, the conversion should be implemented via https://github.com/finalfusion/finalfrontier/blob/9216dfdf0c8941895990fd495961fe99f4e7ceb4/finalfrontier/src/vocab/subword.rs#L211-L228

Have you tried replacing the type parameters with specific types? I.e.:

pub fn build_vocab_from_finaldict(dict: FinalDict) -> SubwordVocab<NGramConfig> {
        let config = SubwordVocabConfig {
                discard_threshold: 1e-4,
                min_count: 5,
                max_n: 11,
                min_n: 9,
                indexer: NGramConfig { min_ngram_count: 50 },
    };
    let mut builder = VocabBuilder::new(config);
    // [...]
    builder.into()
}
jguhlin commented 5 years ago

I switched it to:

pub fn build_vocab_from_finaldict(dict: FinalDict) -> SubwordVocab<NGramConfig, String>

but getting the same error (SubwordVocab needs 2 type parameters, so had to add String).

Thanks for the link to where the conversion is done, I'll dig into it a bit more.

jguhlin commented 5 years ago

Yeah, removing the specific finalfusion rev="c64254a9e738c14e70560db5013ec169f463a057" seems to cause it to have the same error on finalfrontier-utils, so it's somewhere there.

sebpuetz commented 5 years ago

I switched it to:

pub fn build_vocab_from_finaldict(dict: FinalDict) -> SubwordVocab<NGramConfig, String>

but getting the same error (SubwordVocab needs 2 type parameters, so had to add String).

Thanks for the link to where the conversion is done, I'll dig into it a bit more.

SubwordVocab<NGramConfig, String> should be SubwordVocab<NGramConfig, NGramIndexer> iirc. It's hard for me to tell what's going on without seeing more of the code

jguhlin commented 5 years ago

I left it as string and switched to the same versions as finalfusion-util and its training now.

I can definitely share the repository once I'm in front of the computer again.

jguhlin commented 5 years ago

Hey again. I'm trying to build the vocab directly instead of calling builder.count() and it seems Word::new (impl CountedType::new) is pub(crate). Is there any rusty way around that? Or would it be possible to make it pub?

I've got embeddings for single genomes (woo!) and it's decently fast and loss goes down quick. For the large dataset I'm using it's about ~12 days estimated (FastText somehow manages this in ~8hrs). I'd like to see if I can get that down with changing params, otherwise, the uses I have for that may just be a separate project and single genomes / etc use this one.

sebpuetz commented 5 years ago

Hey again. I'm trying to build the vocab directly instead of calling builder.count() and it seems Word::new (impl CountedType::new) is pub(crate). Is there any rusty way around that? Or would it be possible to make it pub?

I think we should switch from Vec<Word> to just Vec<String>. The counts are never used after vocab construction, so I'd be fine with changing that implementation. @danieldk what's your take? Otherwise I also don't see a reason to seal the constructor. Without any of these changes, it's not possible to construct a vocab without going through VocabBuilder.

I've got embeddings for single genomes (woo!) and it's decently fast and loss goes down quick. For the large dataset I'm using it's about ~12 days estimated (FastText somehow manages this in ~8hrs). I'd like to see if I can get that down with changing params, otherwise, the uses I have for that may just be a separate project and single genomes / etc use this one.

Nice! First of all, going up to 12 days from 8 hours sounds really extreme. A bit of a stupid question, but is the program running in release mode?

I remember that you mentioned ngram ranges between 9 and 11, that's fairly long and depending on how the data is structured you'll get loads of unique ngrams. In fastText embeddings, these ngrams are simply hashed and ngram-embedding is retrieved based on the hash. The number of buckets is set ahead of time as a hyperparameter and is by no means influenced by your training data and depending on what your data looks like you get more or less hashing collisions. You could get that behaviour in finalfrontier through the BucketIndexer. With the NGramIndexer you get an explicit mapping for each unique ngram that appears more frequently than min_ngram_count times.

I.e., if you stick to the default finalfrontier settings for bucket vocabs, you'll have 2^21 available buckets (iirc, fastText reserves ~2mill buckets), whatever number of ngrams you encounter, they will be squashed into this space. With the NGramIndexer - essentially a hashtable - the size of your ngram lookup grows with the number of ngrams that make the cut.

danieldk commented 5 years ago

I think we should switch from Vec<Word> to just Vec<String>. The counts are never used after vocab construction, so I'd be fine with changing that implementation.

But the vocab constructors need the counts for computing the discard probabilities. I am fine with making the constructor public though.

Nice! First of all, going up to 12 days from 8 hours sounds really extreme. A bit of a stupid question, but is the program running in release mode?

fastText has always been a bit faster. It might be that the compiler misses some optimizations in our case (the hot loops should be more or less the same -- dot products and (scaled) additions and are vectorized). I did notice on one machine that there was little or no loop unrolling. Didn't look any further. I also want to try if relying on ndarray's methods is still slower. I hand-rolled the SIMD intrinsics, because it was faster at the time, but I recently saw good compiler optimization of ndarray methods in another project.

sebpuetz commented 5 years ago

But the vocab constructors need the counts for computing the discard probabilities. I am fine with making the constructor public though.

Missed that!

As another general suggestion, it could be worth to implement a Vocab of your own (say KMerVocab) and impl From<KMerVocab> for VocabWrap (probably via finalfusion::SubwordVocab<NGramIndexer>) on it. Additionally, your vocab would need to return some KMerVocabConfig that's serde::Serializable. Then you could simply swap out the SubwordVocab with your own type in SkipGramTrainer::new.

Some methods would have to be copied over from finalfrontier::vocab::mod.rs since e.g. create_discards and create_indices are pub (crate).

You'd get full control over e.g. the bracketing of the vocab items and you could extract the length of NGrams that you want to get.

jguhlin commented 5 years ago

Nice! First of all, going up to 12 days from 8 hours sounds really extreme. A bit of a stupid question, but is the program running in release mode?

Release mode with native optimizations. I'm convinced it's likely something on my end, and I'm able to do a much better job with the training data vs. what FastText can give me, so I'm probably sending too much data. But need to get Vocab generating faster and serializable to start tuning parameters. It could be a single epoch is needed for such a large dataset, and only a single sliding window on the forward and reverse complement too.

But the vocab constructors need the counts for computing the discard probabilities. I am fine with making the constructor public though.

I'm also hoping to use the counts to reduce the impact of average weights (since kmers can repeat multiple times in a single sequence) so I also like them.

I remember that you mentioned ngram ranges between 9 and 11, that's fairly long and depending on how the data is structured you'll get loads of unique ngrams. In fastText embeddings, these ngrams are simply hashed and ngram-embedding is retrieved based on the hash.

Thanks, good to know. I actually think I could set it as a single number and still receive most of the benefit. At 9 there should only be 1_953_125 possible kmers (4 letters plus an N). The number of words is 1_220_703_125 but really it is less, as I remove more than 3 N's in a row.

As another general suggestion, it could be worth to implement a Vocab of your own (say KMerVocab) and impl From for VocabWrap (probably via finalfusion::SubwordVocab) on it.

Going to give this a quick try, but I'll probably need some more assistance with it.

jguhlin commented 5 years ago

What's the story with the multiple threads? I'm using 48 - 64 threads.

jguhlin commented 5 years ago

Sorry for the spam.

Trying to implement my own KmerVocab and KmerVocabConfig, having an issue with WordWithSubwordsIdx being private, and the WordIdx trait being private as well (which would let me re-implement it on my own, although it'd be pretty much the same).

sebpuetz commented 5 years ago

What's the story with the multiple threads? I'm using 48 - 64 threads.

There is no lock on the parameter matrices so if multiple threads update embeddings at the same time, some updates can be discarded. Depending on the structure of the corpus (i.e. how likely is it that different threads update the same embeddings at the same time?) these collisions are more or less frequent. For our corpus it turned out that the performance on a (toy-ish) analogy induction task drops quite a lot if we go up to 30 threads from 20. That is in line with what other researchers found (@danieldk linked that over at https://github.com/finalfusion/finalfrontier/issues/61#issuecomment-540997030).

We use that analogy task to verify that the structure of the embedding spaces stays similar across different releases. Since we got a large drop we assumed we had introduced a bug in a recent release but it turned out to be down to the number of threads making updates in parallel. If you're using 48-64 threads you might be running into the same problems. Although, I can't make assumptions about how it affects your representations, but it might be worth to check results with a lower number, too. We haven't evaluated the "bad" models on a downstream task so I can't tell whether the embeddings trained with 30+ threads are bad in general or just bad for analogies.

Trying to implement my own KmerVocab and KmerVocabConfig, having an issue with WordWithSubwordsIdx being private, and the WordIdx trait being private as well (which would let me re-implement it on my own, although it'd be pretty much the same).

I guess there's nothing bad about making the idx module pub? At least the trait should be implementable from the outside. It surprises me that Rust isn't complaining about leaking a crate private type through Vocab::IdxType: WordIdx.

If @danieldk is fine with that module being pub that should be a quick fix.

jguhlin commented 5 years ago

Ah thanks, that makes sense. I think I'm fine with the large vocab and the good loss, but it could be worth checking out.

Thanks. I'll wait for the pub fix before I continue on too much.

Here's something really minor I found, the capitalization is different for SkipGram (Skipgram) for Directional. Not sure if it's a typo or on purpose, and I don't think it matters enough to be it's own issue.

ModelType::SkipGram, ModelType::DirectionalSkipgram, ModelType::StructuredSkipGram,

danieldk commented 5 years ago

Not sure if it's a typo or on purpose, and I don't think it matters enough to be it's own issue.

ModelType::SkipGram, ModelType::DirectionalSkipgram, ModelType::StructuredSkipGram,

That's an oversight, should definitely be fixed.

sebpuetz commented 5 years ago

Sorry for the spam.

Trying to implement my own KmerVocab and KmerVocabConfig, having an issue with WordWithSubwordsIdx being private, and the WordIdx trait being private as well (which would let me re-implement it on my own, although it'd be pretty much the same).

83 exposes the module

jguhlin commented 5 years ago

Having an issue with the From conversion. The fields of SubwordVocab are private, so I can't do the conversion without going through a function (which negates the effect of doing it myself).

error[E0451]: field `indexer` of struct `finalfusion::prelude::SubwordVocab` is private
  --> liboracular/src/kmervocab.rs:64:13
   |
64 |             indexer: v.indexer,
   |             ^^^^^^^^^^^^^^^^^^ field `indexer` is private
...etc

I changed the lines in finalfusion-rust/src/chunks/Vocab.src to pub:

pub struct SubwordVocab<I> {
    pub indexer: I,
    pub indices: HashMap<String, usize>,
    pub words: Vec<String>,
    pub min_n: u32,
    pub max_n: u32,
}

And it's compiling (on the the next issue). There may be a way around it that I missed as well.

But this may not be the approach you guys want, and I'd like to be compatible with whatever you prefer. I've also made the repo public, and the files are here (if you are at all curious):

https://github.com/jguhlin/oracular/blob/master/liboracular/src/kmervocab.rs https://github.com/jguhlin/oracular/blob/master/liboracular/src/vocab.rs

jguhlin commented 5 years ago

Also having this issue, which I've been banging my head against for two days:

rror[E0599]: no method named `write_model_binary` found for type `finalfrontier::TrainModel<finalfrontier::SkipgramTrainer<finalfrontier::util::ReseedOnCloneRng<rand_xorshift::XorShiftRng>, V>>` in the current scope
   --> liboracular/src/embeddings.rs:161:10
    |
161 |         .write_model_binary(&mut output_writer, train_info)
    |          ^^^^^^^^^^^^^^^^^^ method not found in `finalfrontier::TrainModel<finalfrontier::SkipgramTrainer<finalfrontier::util::ReseedOnCloneRng<rand_xorshift::XorShiftRng>, V>>`
    |
    = note: the method `write_model_binary` exists but the following trait bounds were not satisfied:
            `finalfrontier::TrainModel<finalfrontier::SkipgramTrainer<finalfrontier::util::ReseedOnCloneRng<rand_xorshift::XorShiftRng>, V>> : finalfrontier::WriteModelBinary<_>`

Training works absolutely fine, and I had this error with SubwordVocab and now KmerVocab. I think I had it working before using NGramIndexer but can't reproduce that now...

Edit: I've done some more playing around and it appears there is something with importing traits that I'm not understanding / implementing properly. You can see it in the test here: https://github.com/jguhlin/oracular/blob/master/liboracular/src/kmervocab.rs#L295

(Which is nearly an exact copy & paste from one of your tests, just removing some other code). Results in this error:

error[E0277]: the trait bound `finalfusion::subword::HashIndexer<fnv::FnvHasher>: finalfusion::subword::BucketIndexer` is not satisfied
   --> liboracular/src/kmervocab.rs:304:70
    |
304 |         let vocab: SubwordVocab<_, FinalfusionHashIndexer> = builder.into();
    |                                                                      ^^^^ the trait `finalfusion::subword::BucketIndexer` is not implemented for `finalfusion::subword::HashIndexer<fnv::FnvHasher>`
jguhlin commented 5 years ago

Looks like the previous issues were a version mismatch, I switched the finalfusion-rust back to rev c64254a9e738c14e70560db5013ec169f463a057 and made the edits locally and everything seems to be working again (and saving the models!).

Now I'm onto getting everything into python (for theory-testing and working with a collaborator). Cheers

sebpuetz commented 5 years ago

Sorry, I got lost with some other work, glad to hear that everything is working now.

Looks like the previous issues were a version mismatch, I switched the finalfusion-rust back to rev c64254a9e738c14e70560db5013ec169f463a057 and made the edits locally and everything seems to be working again (and saving the models!).

The compiler errors are pretty confusing with version mismatches, but these particular errors can also happen when the trait is not in scope, which at times is no less confusing.

Did you find out what the reason was for the errors?

jguhlin commented 5 years ago

Version mismatch, I tried to get finalfrontier to work with fusion-rust but ndarray errors popped up due to different versions. Switching to the github revision and just making the edits (mostly pub's) by hand, worked.

I think my only outstanding request now is to make finalfusion SubwordVocab fields public, or create a constructor for them? And I'll just track the master branches for awhile until things settle on release (no hurry here). This is exactly where I want to be, and the different models are something fasttext doesn't have and I want to compare them.

jguhlin commented 5 years ago

Hey, hope all is well.

Switching to be in sync with the updated finalfusion-rust and using ExplicitSubwords instead of NGram... but having an issue with rand and rand_xorshift being two different versions between the modules. Any suggestions or is it easy to fix? I'm happy to try and do it and set up a PR if that'd help.

Edit: write_model_binary was version mismatch between various things, fixed now. I manually changed the rand/rng/zipf dependencies and everything is working and I'm in-sync with finalfusion 0.11

If you could make subword vocab have pub fields that'd be great. I believe there is a new fn but I think it brackets somewhere down the line, so I can't use that. If that's not the case let me know and I'll dig through again.

pub struct SubwordVocab<I> {
    pub indexer: I,
    pub indices: HashMap<String, usize>,
    pub words: Vec<String>,
    pub min_n: u32,
    pub max_n: u32,
}
danieldk commented 5 years ago

Edit: write_model_binary was version mismatch between various things, fixed now. I manually changed the rand/rng/zipf dependencies and everything is working and I'm in-sync with finalfusion 0.11

Should be resolved permanently once #90 is merged.

jguhlin commented 5 years ago

Awesome, I saw that. Thanks!

danieldk commented 3 years ago

Closing this issue. Feel free to open another one if necessary!