gagneurlab / SpeciesLM

MIT License
12 stars 1 forks source link

Fungal Species Token #3

Closed aashutoshb97 closed 9 months ago

aashutoshb97 commented 9 months ago

Hi Firstly, great work! I am keen on utilizing your model to explore fungal species like Pichia. I'm curious if there's a documented list of proxy_species or organism labels employed in the model. It appears to function when using 'Pichia kudriavzevii str. CBS573,' but I'm uncertain whether it operates with default proxy_species or performs a closest match. I would greatly appreciate it if you could share the list of proxy_species or organism labels used in the model. Thanks.

Karollus commented 9 months ago

Hi,

The species tokens are stored in the tokenizer, under the variable additional_special_tokens. E.g. you can run [x for x in tokenizer.additional_special_tokens if x.startswith("pichia")]

Which should return: ['pichia_kudriavzevii_gca_000764455', 'pichia_kudriavzevii_gca_001983325', 'pichia_kudriavzevii_gca_002166775', 'pichia_kudriavzevii_gca_003054445', 'pichia_membranifaciens_gca_001950575', 'pichia_membranifaciens_nrrl_y_2026_gca_001661235']

I should warn that if one uses a token which is not in there it does not throw an error, but it converts it to the [UNK] token (for "unknown"), which is probably not what you want - I have no idea what happens to the predictions in that case. This being said, if you have a different strain of one of the species listed above, you can probably savely use one of these as proxy - I am guessing the regulatory mechanisms should change much between strains of the same species and we get good results using quite diverged species as proxy for S. cerevisiae.

Hope this helps

aashutoshb97 commented 9 months ago

Hi, Thank you for the quick response. That certainly helped and I also observe improvements in model results (R2 score went from 0.21 to 0.32).