clulab / processors

Natural Language Processors
https://clulab.github.io/processors/
417 stars 101 forks source link

Added code to generate embeddings #765

Closed RazvanDu closed 7 months ago

RazvanDu commented 7 months ago

The code can generate custom sub-word embeddings based on a model type and a data set.

RazvanDu commented 7 months ago

Alright I fixed a small typo issue, remove the useless 'crr', and I added seeds. Let me know if there's anything else to implement.

kwalcock commented 7 months ago

Thanks for adding the seed. I did look up numpy.random.seed and you might want to read it in case it is ever important. If, for example, any of the dependencies sets the seed itself or for some reason generates an extra random number because it is Tuesday, your reproducibility might not work. It might be worth documenting that -1 is a special value that means not to set the seed at all.

For crr I was just wondering what the letters mean. Google didn't help me with it.

I'll ask again about the sampling in person and in the meantime check this in. Change anything if/when you like, of course.

RazvanDu commented 7 months ago

Thanks for adding the seed. I did look up numpy.random.seed and you might want to read it in case it is ever important. If, for example, any of the dependencies sets the seed itself or for some reason generates an extra random number because it is Tuesday, your reproducibility might not work. It might be worth documenting that -1 is a special value that means not to set the seed at all.

For crr I was just wondering what the letters mean. Google didn't help me with it.

I'll ask again about the sampling in person and in the meantime check this in. Change anything if/when you like, of course.

Oh that's interesting, I'll look into it soon.

Crr is a short form of current, I use it when I write code in a rush/very fasy.

Alright, thanks!