apcamargo / genomad

geNomad: Identification of mobile genetic elements
https://portal.nersc.gov/genomad/
Other
169 stars 17 forks source link

The xgboost model in the code and how to train the xgboost model #63

Closed ZoeLct closed 5 months ago

ZoeLct commented 6 months ago

Hello, As a beginner in this field, I am reaching out to you with a few inquiries that I hope you might be able to assist me with. To begin with, I am curious if the XGBoost model you are using has been pre-trained? If so, would it be possible for you to share the file of this model? Furthermore, I am eager to comprehend the training process of the XGBoost model within this code segment. What dataset is employed, and how is it applied? Could you possibly have any associated training scripts that you could share? I sincerely hope that my questions do not impose any inconvenience on you. I am eagerly awaiting your guidance and appreciate your time and assistance in advance.

apcamargo commented 6 months ago

Hi @ZoeLct,

Yes, the model is pre-trained. You can download it here.

I don't have the code to train the model anymore, but it is very simple to replicate (see here). You can find all the training data in Zenodo: https://zenodo.org/records/8049246

The code to generate features from prodigal-gv's and MMseqs2's outputs can be found here.