bio-tools / biotoolsRegistry

biotoolsregistry : discovery portal for bioinformatics
GNU General Public License v3.0
70 stars 21 forks source link

Request to include MultiMolecule in bio.tools #598

Closed ZhiyuanChen closed 1 month ago

ZhiyuanChen commented 2 months ago

Hi there,

Thank you for this wonderful registry.

Recently I have been developing MultiMolecule.

MultiMolecule is designed to be a deep learning toolkits for molecular biology.

Our goal is to make deep learning methods accessible to everyone in the community.

Currently, we have included many pre-trained deep learning models in RNA in our library, and the datasets used to train them. The pre-trained weights and datasets are also accessible in our 🤗 hub in a unified format for easier access.

We are working on adding the training scripts so that everyone can train / fine-tune their own machine learning models in a few clicks.

magnuspalmblad commented 2 months ago

I made a draft (incomplete) entry, but I am not sure this is a "tool" as much as it is a training resource for applying machine learning to nucleotide and protein sequence data, (quite) analogous to ProteomicsML for mass spectrometry-based proteomics data. Perhaps this could be registered as training material in TeSS rather than as a tool in bio.tools? If so, the bio.tools entry can be removed.

The Apache Parquet format should be added to EDAM, if it hasn't already.

ZhiyuanChen commented 2 months ago

Thank you for your quick response!

We do provide many resources (models and datasets), but the core of MultiMolecule is to be a framework (a tool) for users who want to run machine learning models on their own data.

As it's still in its early stage, we now focus on pre-training. i.e., we provide pre-train dataset for those who design their own network, and we provide existing pre-trained models so they can compare their method with current SOTAs.

We are working on the fine-tuning part, in this stage, most people (who have a GPU) can fine-tune a model (from hugginface community or from the pre-trained model we provide) on their own data, with one command only.
We have done a lot of work to allow the framework recognise the user dataset automatically (so that users do not need to specifically prepare a data file). We almost complete this part, and we are still waiting for user feedback for improvements.

I hope, in the next stage, we can provide pre-defined pipeline and fine-tuned models, so that every one can apply machine learning algorithms and inference on their own data (without the need of GPU), in one line of code.

veitveit commented 1 month ago

I assume this can be considered solved.